Numerical Methods: 1 Iteration Programming and Algebraic Equations B. NOBLE dsc UNIVERSITY MATHEMATICAL TEXTS General Editors Alexander C. Aitken D'Sc FRS D. E. Rutherford DSc DrMath OLIVER AND BOYD LTD UNIVERSITY MATHEMATICAL TEXTS GENERAL EDITORS ALEXANDER C. AITKEN, D.sc., F.R.s. DANIEL E. RUTHERFORD, D.sc, dr. math. ASSISTANT EDITOR IAIN T. ADAMSON, PH.D. 29 NUMERICAL METHODS ITERATION, PROGRAMMING AND ALGEBRAIC EQUATIONS 5- Sot> 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. IS. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. Volume and Integral Tensor Calculus UNIVERSrTY MATHEMATICAL TEXTS Determinants and Matrices ... A. C. Aiikcn, D.Sc., F.R.S. Statistical Mathematics A. C. Aiiken, D.Sc., F.R.S. Integration R. P. Gillespie, Ph.D. Integration of Ordinary Differential Equations E. L. Incc, D.Sc. Vector Methods D. E. Rutherford, D.Sc., Dr.Malh. Theory of Equations H. W. Turnbull, F.R.S. Functions of a Complex Variable ... E. G. Phillips, M.A., M.Sc. Wanes C. A. Coulson, M.A., D.Sc., F.R.S. Infinite Series J. M. Hyslop, D.Sc. Analytical Geometry of Tliree Dimensions W. H. McCrea, Ph.D., F.R.S. Electricity C. A. Coulson, M.A.. D.Sc.. F.R.S. Projective Geometry T. E. Faulkner, Ph.D. Introduction to the Theory of Finite Groups W. Ledcrmann, Ph.D., D.Sc. Classical Mechanics ... D. E. Rutherford, D.Sc., Dr.Malh. Partial Differentiation R. P. Gillespie, Ph.D. ... W. VV. Rogosinski, Dr. Phil., F.R.S. B. Spain. Ph.D. German-English Mathematical Vocabulary ... S. Macintyre, Ph.D., and E. Witte, M.A. Special Functions of Mathematical Physics and Chemistry I. N. Sneddon, D.Sc. Topology E. M. Patterson, Ph.D. The Theory of Ordinary Differential Equations Fluid Dynamics Special Relativity Real Variable Russian Reader in Pure and Applied Mathematics . . . Russian-English Mathematical Vocabulary J. Burlak, M.Sc., Ph.D., and K. Brooke, M.A. Introduction to Field Theory ... lain T. Adamson, Ph.D. Number Theory J. Hunter, Ph.D. Numerical Methods'. I. Iteration, Programming and Algebraic Equations ... B. Noble. D.Sc. Numerical Methods'. 2. Differences. Integration and Differential Equations B. Noble, D.Sc. Elementary Abstract Algebra E. M. Patterson, Ph.D., and D. E. Rutherford, D.Sc., Dr.Math. Vector Spaces of Finite Dimension G. C. Shephard, M.A., Ph.D. Magnetohydrodynamics A. Jeffrey, Ph.D. ... J. C. Burkill, Sc.D, F.R.S. D. E. Rutherford, D.Sc., Dr.Math. W. Rindler, Ph.D. J. M. Hyslop, D.Sc. ... P. H. Nidditch. Ph.D. NUMERICAL METHODS ITERATION, PROGRAMMING AND ALGEBRAIC EQUATIONS BEN NOBLE M.A., B.Sc. D.Sc., A.M.I.E.E. THE MATHEMATICS RESEARCH CENTER, VJ. ARMY, THE UNIVERSITY Of WISCONSIN, MADISON, WISCONSIN, U.S.A. (fORMERLY OFT1IE DEPARTMENT OF MATHEMATICS, THE ROYAL COIXEOE OF SCIENCE AND TECIINOLOOV, OIJVSOOW, SCOIXAND) OLIVER AND BOYD EDINBURGH AND LONDON NEW YORK: INTERSCIENCE PUBLISHERS INC. OLIVER AND BOYD LTD Tweeddale Court Edinburgh 1 39a Welbeck Street London W.l First published Reprinted 1964 1966 © 1964 B. Noble Primed in Great Britain by Oliver and Boyd Ltd., Edinburgh PREFACE It is desirable that all students or applied mathematics, the physical sciences, and engineering should be given an introduction to numerical analysis. The aim of this book is to provide, for this purpose, an elementary exposition of certain basic ideas. This has been done by considering a strictly limited selection of typical methods with particular emphasis on fundamentals, including error analysis, error estimation, pitfalls, and the detection of mistakes. An effort has been made to avoid the common practice of regarding numerical methods as a collection of algorithms or recipes. Rather than use a given amount of space to describe several different methods for solving the same type of problem, it has been thought preferable to illustrate general principles by discussing one single method in some detail. Numerical analysis should be a source of useful tools but it should also make the student think. The advent of automatic computers has produced a revolution in computing. For this reason programming for computers has been made an integral part of the exposition. Although it is essential to teach the basic ideas of programming, the author believes that the time spent on the technical details in an elementary course should be cut to a minimum. There arc advantages in teaching a simple general form of automatic programming or autocode of the type developed in Chapter III, which need occupy no more than one or two lectures. The system is not applicable to any particular machine but it contains many of the features common to autocodes. If it is preferred to teach a program- ming system used by an existing machine, the programs in Vi PREFACE this book can easily be converted into the appropriate autocode. (The word " program " is used in the technical sense of a set of instructions specifying a procedure or programme for solving a problem on a real or hypothetical automatic computer.) An automatic computer is indispensable for any institu- tion concerned with the advanced teaching of scientists, but the place which should be occupied by computers in teaching elementary numerical analysis is likely to remain the subject of debate for many years. One can visualise courses in which all the practical work is done on automatic com- puters, and equally successful courses in which only hand machines are used. It is instructive and stimulating for a student to see a computer in operation but he need not become a programmer to learn about numerical analysis. Hand computing and automatic computing may require different habits of thought, but the basic principles are common and these can almost always be illustrated by simple examples. Difficulties in large-scale calculations are not usually due to the emergence of some radically new kind of error. The main point is that all students should have first-hand experience of numerical calculation, how- ever this is obtained. There is usually a " correct " answer to a problem in numerical mathematics in a precisely defined sense. The onus is on the person specifying the calculating procedure to ensure that the information obtained during the computation will indicate in what sense, if any, the calculation has resulted in an approximation to the " correct " answer. It may help to impress this upon the student if he sees an automatic computer produce, without warning, numbers which are meaningless and wrong, as the result of a program which he himself has written. When all computations have to be performed by hand, the student tends to devote a major part of his effort to the technical details of numerical procedures. When the PREFACE vii attitude is adopted that most calculations will be performed by computer, the emphasis in introductory courses can be shifted from the details of methods to the theory behind classes of methods. Though the student should have some acquaintance with the details of typical procedures, there is little point in teaching complicated algorithms to students who will normally meet such algorithms only in standard computer programs devised by experts. Thus the author would not normally teach Bairstow's method for finding the complex roots of polynomials (§ 2.7) in an elementary course. Many examination questions on numerical methods are designed to test powers of regurgitation rather than under- standing. The standard questions beginning " Describe a method for . . ." and " Find the solution of . . ." should not exclude other types of question, for example : " The calculations described below gave the following results. . . . Find out whether the required solution of the problem can be deduced from these results. In particular estimate the effect of truncation error (or round-off error, or instability of the method, or ill-conditioning of the original problem). Suggest methods for checking your conclusions." It seems to be common practice to teach numerical analysis and basic mathematics in separate courses, and to postpone numerical analysis to a late stage —even a post- graduate stage — in a student's career. The author believes that this is wrong in principle and that much of the material in this book should ultimately be integrated into courses on basic mathematics for applied scientists. The material should be used to illustrate the mathematics, and to provide motivation and stimulation for more mathematics. Operator methods for the derivation of finite-difference formulae have been omitted completely. Experience in teaching indicates that students quickly acquire facility in obtaining complicated formulae by operators without having much idea of the limitations or implications of their viii PREFACE results. It seems preferable in an introductory course to concentrate attention on the basic underlying principle, namely that functions arc being represented by polynomials through equally spaced points. Other types of functional representation can be introduced as natural alternatives. Most error estimates have been obtained in the form of a dominant term followed by a series of dots. Little attempt has been made to obtain closed error estimates by using a Taylor scries with a remainder term, or mean-value theorems, since at an elementary level this complicates the analysis without contributing materially to the understand- ing of either the method or the results. Questions of elegance and rigour enter into the discussion of a numerical method but this book is not aimed at the pure mathematician. In some ways numerical analysis should be approached as if it were an experimental science. An exact answer exists for any given problem, and the numerical analyst attempts to determine this exact answer, to a specified degree of accuracy, by means of suitably designed experiments. Mathematics is the tool, but not the aim or object, of the science. Short chapters are included on eigenvalues and partial differential equations. These are topics which are normally regarded as " advanced ", but from a numerical point of view they can be treated in almost as elementary a fashion as the solution of algebraic equations, or ordinary dif- ferential equations. For further reading it is recommended that the student start with Modern Computing Methods, H.M. Stationery Office, 2nd. Edn. (1961), which both supplements and complements the present text, and includes a list of papers and books. In view of the comprehensive bibliography in Modern Computing Methods it has seemed unnecessary to include more than a few (non-systematic) references in the present text. Unless otherwise indicated the numerical examples have PREFACE ix been worked on a hand-machine, and as a general rule one extra decimal place has been carried in the intermediate working. When the student checks numerical examples his results may of course differ slightly from those in the text in the last significant digit. Many of the examples at the ends of chapters are intended to supplement the text, not to provide additional exercises. The origins of this book can be traced back to the lectures of the late D. R. Hartree, and a year spent in the Mathematical Laboratory, Cambridge, under M. V. Wilkes in 1947-48 during the exciting early days of EDSAC. Even though this experience failed to convert me into a professional computer, it taught me that numerical analysis is much more than an exercise in pure mathematics or a collection of recipes. The preparation of the manuscript of this book was a time-consuming occupation. I feel indebted to the mathe- matics department of the future second university in Glasgow, to my wife, and to all correspondents who have had to wait for answers to letters. They had much to endure while the second draft of this book was being typed with one finger. It is a pleasure to acknowledge help from many friends. The ideas in Chapter III on programming for automatic computers originated from using the system known as FORTRAN (FORmula TRANslalion) on the IBM 704 at the New York University AEC Computing Facility. I am grateful to E. Isaacson for encouragement to use this machine and to Max Goldstein for indispensable advice. The work was made possible by contract AF 19(604)5238 of the Electronics Directorate of the U.S. Air Force Cambridge Research Centre, which financed a visit to the Division of Electromagnetic Research, the Institute of Mathematical Sciences, New York University. A first draft of Chapter III was considerably altered after correspondence with T. Brooker and S. Gill, and a subsequent meeting with members x PREFACE of the committee on automatic programming of the British Computer Society, chairman C. Strachey. A remark of T. Brooker changed the emphasis of Chapter II from " solution of equations " to " iterative procedures ". D. J. Wheeler introduced me to the idea of the " noise-level " of an iterative calculation. Various comments in the text on the representation of functions by polynomials have been modified as a result of criticisms by G. F. Miller. Several points have been clarified by dis- cussions with A. J. Howie, G. N. Lance, A. van Wijngaarden and J. H. Wilkinson. Many of the numerical examples have been checked by A. Paton and M. V. Sweet. Numerical results have been contributed by: J. G. Fraser and M. V. Sweet (Ex. 10.23), A. Paton (Table 4.1), M. Walker (Ex. 11.8), F. J. Warner (Ex. 10.17), S. B. L. Wilson (the table (11.26)), C. S. Wu (Ex. 10.22). A. C. Aitken, J. G. Fraser and M. V. Sweet have spent much time in reading the manuscript in painstaking detail. Their valuable criticisms have increased the accuracy and readability of the text and the examples. G. Fairweather, R. C. Peebles and B. Shaw have assisted in proof- reading and contributed helpful comments. D. Greenspan has suggested improvements and detected inaccuracies in both manuscript and proofs. I am indebted to the editors, A. C. Aitken and D. E. Rutherford, for advice and recom- mendations. Above all I am grateful to Miss A. Paton. In a period of numerous distractions her assistance ensured a regular weekly production of manuscript. It is certain that without her help this book would never have been written. B. NOBLE Mill Cottage, Broughton Mills, Lancs. July, 1962. CONTENTS Preface Volume I— ITERATION, PROGRAMMING AND ALGEBRAIC EQUATIONS I. Accuracy and Error v 1.1 Introduction 1 1.2 Rounding off 2 1.3 Absolute and relative errors 3 1.4 Error analysis and control 6 1.5 The evaluation of formulae on desk machines 11 1.6 Mistakes 15 Examples I 17 11. Iterative Methods, with Applications to the Solution of Equations 2.1 Introduction 19 2.2 A simple iterative method 20 2.3 The Newton-Raphson iterative method 26 2.4 General aspects of iterative procedures 31 2.5 Real roots of polynomials 38 2.6 Errors when finding roots of polynomials 41 2.7 Bairstow's method for finding complex roots of polynomials 46 Examples II SO III. Elementary Programming for Automatic Computers 3.1 Introduction 54 3.2 Simple programs 56 3.3 Some programs involving iterative procedures 65 3.4 General comments 75 Examples HI 78 xi xii CONTENTS IV. Simultaneous Linear Algebraic Equations 4.1 Introduction 80 4.2 The method of successive elimination 82 4.3 Choice of pivots and scaling 87 4.4 Inherent error and ill-conditioned equations 92 4.5 A computer program for the method of successive elimination 98 Examples IV 100 V. Matrix Methods 5.1 Matrix algebra 102 5.2 A compact elimination method for the solution of linear equations 105 5.3 The inverse matrix 112 Examples V 118 VI. Eigenvalues and Eigenvectors 6.1 Introduction 123 6.2 An iterative method for finding the largest eigen- value 129 6.3 The determination of subdominant eigenvalues and eigenvectors 135 6.4 The iterative solution of linear simultaneous algebraic equations 138 Examples VI 147 Index 1 54 Volume II.— DIFFERENCES INTEGRATION AND DIFFERENTIAL EQUATIONS VII. Finite Differences and the Approximate Repre- sentation of Functions VIII. Polynomial Interpolation IX. Numerical Integration and Differentiation X. Ordinary Differential Equations XI. Partial Differential Equations CHAPTKR 1 ACCURACY AND ERROR § 1.1. Introduction. The solution of a problem in applied mathematics usually consists of numbers which satisfy some kind of equation. In theory these numbers may be specified exactly by the equation but even in the simplest cases (of which x = x /2 can be taken as a trivial example) it is normally impossible to write down an exact decimal representation of the solution. The aim of numerical methods is to provide practical procedures for calculating the solutions of problems in applied mathematics to a specified degree of accuracy. We shall be constantly preoccupied with the estimation and prevention of errors. This is merely recognition of the fact that to produce a number which is the answer to a problem is only half the aim of numerical methods. The other half, and often the more difficult half, is to ensure that the answer is correct within stated limits. Any numerical answer should be accompanied by a statement of ils maxi- mum error or its probable error. Errors occur in computation in four ways: (a) Round-off errors due to the need to represent numbers in decimal notation, using a finite number of decimal places. (b) Mistakes. (c) Errors due to the nature of the formulae used, e.g. the need to represent a continuous function by a table of numbers. NUMERICAL METHODS 1 §1-2 (d) Some of the numbers which must be used in the computation may have inherent error, or may be known only approximately, e.g. physical data. It is a most important principle in numerical work that all sources of error must be borne constantly in mind. § 1.2. Rounding off. The numbers 0-745 25 and 0-007 452 5 are given to five and seven decimal places respectively. The number of significant figures is defined as the number of digits that are assumed to be correct, starting at the left with the first non-zero digit, and proceed- ing to the right. Thus if all the non-zero digits in the above numbers are correct, they both possess five significant figures. Zeros to the left of the first non-zero digit are not significant and serve merely to mark the decimal point. Zeros to the right of the last non-zero digit are regarded as significant. Thus 0-4300 possesses four significant figures. Ambiguity occurs with a number like 4300. This can be written 4-300 x 10 3 if it is desired to indicate that it is correct to four significant figures. 4-30 x 10 3 is correct to three significant figures. If only certain digits in a number arc correct the superfluous digits can be discarded by rounding off. The rule for doing this is as follows. To round off a number to n significant figures retain n significant digits and discard the remainder. If the discarded number is less than half a unit in the nth place leave the iith digit unchanged : if it is greater than half a unit in the wth place add unity to the ;ith digit: if it is exactly half a unit round off to the nearest even digit. The object of the last part of the rule is to ensure that when exactly half a digit is discarded, numbers will be rounded up and rounded down in equal proportions, on the average. The only reason for rounding to even rather than to odd digits is that the resulting number is divisible by two without ambiguity. If 7-4525 is rounded 1.3 ACCURACY AND ERROR off to 4, 3, 2 figures we obtain 7-452, 7-45, 7-5. But note that if 7-45 is rounded off to two significant figures, without further information, we obtain 7-4. § 1.3. Absolute and relative errors. If n is an approxima- tion to a number N which is defined exactly, the error c of n is defined by N = n + e. (Some writers use the definition n = N+e but the distinction is not important though it is better to be consistent.) The absolute error is | e | and the (absolute) relative error is | e |/| N |. We shall be interested in comparatively rough estimates of the error and we assume that the relative error is small. In this case the relative error can be taken equal to | e |/| n | which has the advantage that the denominator is known. It is often possible to establish an upper limit for the magnitude of the error. For example if a number is given to exactly m decimal places then | c | g0-5 x 10~ m . We call this upper limit for the magnitude of the error the maximum error and denote it by e. Thus | e | ^e. The relative error is less than e/\ N\xe/\ n | (assuming that e is much less than | n |). Let 7i,, » 2 be approximations to N x , N 2 with errors e,, e 2 , so that Ni = ;ii+e,, N 2 = 71, + e,. Consider first addition and subtraction: #,+^ = (71, +71,) + (£,+£,), JV l -/V 2 = (77,-71 2 ) + (E 1 -E,). then \Ei+e 2 If <e 2> <e,+e 2> .£i \<e t , I E i — s i l <e i + e 2- Hence if two numbers are added or subtracted, the maximum error in the sum or difference is the sum of the maximum errors in the two numbers. For multiplication we have N,N 2 = 71in 2 + 71 2 E 1 +/! 1 £ 2 + £ 1 £ 2 . NUMERICAL METHODS §1-3 §1.3 ACCURACY AND ERROR If we denote n x ii 2 by n, and the error in n by e, c = #i 2 e, +ll 1 «a, i-e. (e/n) = (e,//ii) + (s 2 /M 2 ). where a second order term £,£2 has been neglected. Hence the relative error in a product is the sum of the relative errors in the two numbers which have been multiplied together. In a similar way, for division, N,/N 2 = (w 1 + £,)/(w 2 +fi 2 ) « /ir'("i+ei)(l-C2/"2) » (ni/n 3 )+(«i/»a)-(»i*a/ B a). where we have used the binomial theorem and neglected second-order terms. Introduce n = /j,/« 2 with error e. Then (fi/n) = (e, //;,)- (e 2 /« 2 ). Wc deduce from these results that for both multiplication and division the maximum relative error in the result is the sum of the maximum relative errors of the given numbers. Absolute error is connected with decimal places and is important when adding and subtracting. Relative error is connected with significant figures and is important when multiplying and dividing. We consider some numerical examples which illustrate and extend these remarks. It will be assumed that the numbers with which we start are correct to the last digit stated. Then a number with approximate representation 3-24, say, must lie between 3-235 00... and 3-24499.... A convenient way of expressing this roughly is to say " 3-24 is correct to 1 in 600 ". The sum 49-84+015 = 49-99 is correct to 1 in 5000 even though one of the numbers is correct to only 1 in 30. The sum 0-4984+15 = 15-4984 is correct to only 1 in 30. The difference 49-99-49-84 = 015 is correct to only 1 in 15 although the original numbers are both correct to 1 in 10,000. The loss of significant figures in this way is one of the most insidious sources of error in numerical work. It is often possible to avoid errors due to the subtraction of two nearly equal numbers by transforming the formulae used in the computation, as illustrated in exs. 1.1, 1.2 below. The sum S = 15-9+ 13-444+10-7246 = 400686 is accurate only to +005. We can discard the last two digits in the answer, writing S = 4007±005. It would be slightly misleading to write the answer as 401, but this is of course permissible if it is recognised that the last digit may be nearly one unit in error. Alternatively one could write S = 40-0 7 where the notation indicates that the number is accurate to within a few units in the depressed digit. We could save ourselves labour if we rounded off to two decimal places before adding : 5« 15-9 + 13-44+ 10-72 = 4006. This is not strictly speaking correct since the sum is in fact 4007, but a unit in the second decimal place is not important when we know that the result may be in error by live units in this place. On the other hand it is not per- missible to write 5= 15-9 + 13-4+10-7 = 400. The errors we have introduced by rounding off are now comparable with the inherent error already present so that we have unnecessarily increased the error. The moral is that if we wish to round oft* numbers before a series of additions and subtractions, and the least accurate number is accurate to /; decimal places, we should round off to n + r decimal places where r is a suitable positive integer usually chosen to be 1 or 2. When multiplying or dividing, if the numbers have nearly the same relative error the relative error in the result is approximately twice the relative error in the individual numbers. If one number is more accurate than the other, the relative error is approximately the same as the relative error in the less accurate number. Thus 3-34 2 = 11-1556 is accurate to 1 in 300, and 3-34x3-335= 111389 is NUMERICAL METHODS I §1-4 accurate to 1 in 600. Although the less accurate number in each of these examples is given to three significant figures it would be misleading and inaccurate to round-off the answers to three significant figures. § 1.4. Error analysis and control. When using a desk calculating machine, round-off errors arc under the direct control of the person doing the computation. It would be tiresome to have to round off every single number in a calculation to the appropriate number of significant figures. On the other hand it is essential to employ round-off to save unnecessary labour and keep the lengths of numbers within reasonable limits. One difficulty is illustrated bv the calculation 0014 63-=- 79-3 = 0000 184 5. To obtain an answer correct to 1 in 1000 it is necessary to carry seven decimal places in the answer. On the other hand if we can scale our variables so that the calculation reads 1 -463 -i-O-793 = 1-845 we need carry only three decimals. It is desirable to arrange, by scaling or otherwise, that in the body of the calculation all numbers are rounded to n+ 1 or //+ 2 decimal places and that this then gives an answer correct to n decimals. A more elaborate example of scaling is given in §4.1, Ex. 4.1. A similar situation can arise in connection with auto- matic digital computers. These machines perform essentially the same arithmetic operations as a desk calculator — they add, subtract, multiply and divide numbers stored inside the machine (see § 3.1). Analysis of round-off errors will depend on the way in which numbers are stored. If a fixed-point system is used this means that numbers are represented with a fixed number of decimal places e.g. 00015, 7931482. If a floating-point system is used numbers are represented with a fixed number of significant figures e.g. 1-463 xlO -3 , 7-931 x 10 2 , where we have placed the decimal point immediately after the first significant digit. The scaling difficulty mentioned in the last paragraph may §1.4 ACCURACY AND ERROR be particularly important if calculations have to be per- formed in a fixed-point system. In hand computing it is possible to vary the number of decimal places being carried in the light of the results obtained as the calculation proceeds in order to obtain an answer of the required accuracy with the minimum of labour. In automatic computing the number of decimals depends on the design of the machine and it may be difficult to predict beforehand the course of a calculation. It is often easier to make the machine compute as accurate an answer as possible, subject to round-off and other errors, and then determine the accuracy that has been achieved, rather than try to produce an answer of specified accuracy. It is important to realise that calculations can often be arranged so that the effect of round-off errors is minimised. This is illustrated in the following two examples. Ex. 1.1. Find the roots of x 2 -26.x + 1 = 0. We have x = 13±7l68 = 13± 12-961. Hence the roots are .v, = 25-961, x 2 = 0039. The second root has been obtained to only two significant figures even though the square root was calculated to five figures. However, instead of calculating the second root in this way we can use the relation x t x 2 = 1, so that .v 2 = 1/25-961 = 0038 519, in error by at most one unit in the last digit. Ex. 1.2. Evaluate c = bA-aB b-a ' (1.1) withb = 0-485,a = 0-231,5 = 6-327 19, A = 6-322 81, where the data are accurate only to the number of figures stated. If we use the formula for c as it stands, it might seem that the result will be accurate to only three significant figures since the denominator is accurate to only 1 in 250. 8 NUMERICAL METHODS I However, if we rearrange the formula in the form c = A- a(B-A) b-a 81.4 (1.2) we find, on substituting numbers and carrying out a rough error analysis, that c = 6-318 83, correct to within two units in the last digit quoted. If we substitute directly in the original formula (1.1) and carry sufficient places in the intermediate working we find c = 1-604 982/0-254 = 6-318 83. This is the same answer as before, accurate to two units in the last digit quoted, even though we know that the denominator is accurate to only 1 in 250. The reason is clear on performing an error analysis algebraically. Suppose that the exact value of a is given by a + da, and similarly for b, A, B, c. Then on neglecting second order quantities it is found that bA-aB b-a 6c = (b + 5b)(A + 5A)- (a+5a)( B+ SB) (b-a) + (db-5a) *{(B- A)(aSb - bSa) + (/> - a)(.l>5A -aS B)}(b -a)' 2 . Terms involving A and B have cancelled to give a final error which depends only on (B- A), SA, and 5B, all of which are small. Substitution in this formula again gives the maximum possible error as two units in the fifth decimal place. Formula (1.2) is more convenient than (1.1) since the error can be estimated more directly from it, and since fewer decimal places need be carried in the computation to obtain a given accuracy in the final answer. This is an important example of the way in which a formula can be rearranged to give a more suitable formula for computation. Error analysis in terms of maximum error, as discussed in the last section, is useful in avoiding particular sources of error, but in a long calculation an error analysis worked §1.4 ACCURACY AND ERROR out on the basis of the maximum possible error will over- estimate the errors which arc likely to occur in practice by a very large factor. For instance if the maximum absolute error of a single result is e then the maximum possible error that can be obtained when adding n numbers is ne, though this is very unlikely to occur in practice. In fact if errors are randomly distributed between ±e, there is only one chance in 20 that the error in the sum of n numbers will exceed 2(/j/3)*e, assuming that the errors in the n numbers are independent. There is only one chance in 370 that the error will exceed 3(/;/3)*<>, and this number is much less than ne if/; is large. More precisely, the following results are given by statistical theory.f The details lie outside the scope of this book. If errors are distributed around zero according to a frequency distribution y = /(*), the variance of the error in a single number is defined to be = ' .v 2 /(.v)r/.v, J —00 f(x)dx = 1 The variance of the error in the sum of n numbers is no 2 and the standard deviation of the error of the sum is the square root of the variance, or /iV. The distribution of the sum tends to the normal or Gaussian form as n increases. Thus there is only one chance in 20 of obtaining an error of more than twice the standard deviation. As an applica- tion of these results suppose that errors are uniformly distributed between ±0-5, so that/(.v) = 1, -0-5< v<0-5 and /(.v) = for I x | >0-5. Then , J* " 5 J-0-5 x 2 dx = -£. The standard deviation of the error of the sum of n numbers is therefore (n/12)*. On adding 100 numbers the standard t A. C. Ailken, Statistical Mathematics, Oliver and Boyd (1942) PP. 35, 62. 10 NUMERICAL METHODS I §1.4 deviation is approximately 3 and if the errors are randomly distributed it is unlikely that the total error will be more than ±6. It is very unlikely that the error will be more than ±10. In contrast, the maximum possible error is + 50. Statistical considerations are particularly important in connection with automatic computers, where calculations may involve an extremely large number of operations. From an analogy with noise in an electrical circuit, round-off errors which occur at random can be regarded as " noise " superimposed on an exact calculation. Some interesting comments on round-off noise are given in R. W. Hamming, Numerical Methods for Scientists and Engineers, McGraw-Hill (1962), Chapter II. It is also pointed out in this chapter that in many circumstances the size of the leading (i.e. most significant) digit in the numbers of a calculation may tend to be small. Thus if two numbers are multiplied together, each being chosen at random from 1, 2, ... 9, then by examining the leading digits in the 81 possible products it is easily seen that it is probable that the leading digit will be a number with a small leading digit. A theoretical analysis of error, statistical or otherwise, is invaluable when comparing the relative efficiencies of various methods for solving a given type of problem. Even when it is possible to estimate errors within useful limits, however, we are usually interested in the actual error in the answer to a specific concrete example rather than the average or maximum error when the method is applied to a class of examples. For this reason we shall not devote much space in this book to the determination of maximum or average errors in computing procedures. Instead the following approach is adopted. After a general analysis of the sources of error in a given method, the numerical procedure is arranged so that an estimate of the accuracy of the answer can be obtained empirically from the results obtained during the course of the calculation. Some of the ways in which this can be done will be explained later in §1.5 ACCURACY AND ERROR 11 connection with particular examples, but the general principle is stated here to emphasise that from the point of view adopted in this book the object of theoretical error analysis is practical error control. § 1.5. The evaluation of formulae on desk machines. It will be assumed in this section that the reader is familiar with the operation of an ordinary desk calculator possessing a setting register, a multiplier register, and an accumulator. For example, in forming the product ab the number a is inserted in the setting register. When this number is multiplied by b the number b appears in the multiplier register and the product ab in the accumulator. Without going into detail we emphasise briefly several aspects of the use of such machines, with particular reference to the way in which good machine technique can prevent the occurrence of mistakes. An elementary example occurs in connection with reading negative numbers off the machine. Negative numbers appear in the accumulator as comple- ments, for example 99 . . . 994 732 represents the number - 5268. In reading such a number off the machine, the safe procedure is to set the complement, calculated mentally, (namely 5268), in the setting register; add into the accumulator; check that the accumulator then reads zero, and record the number in the setting register. Another example occurs in connection with transferring numbers from the accumulator to the setting register. The number which is entered in the setting register should be checked in an obvious way by subtracting it from the contents of the accumulator. On the whole it is a good idea to do as much work as possible on the machine. For instance unnecessary recording of numbers should be avoided. Thus axbxc and axb+c can be calculated without recording inter- mediate numbers. Similarly expressions like a 1 b l +a 2 b 2 +a 3 b i + ... 12 NUMERICAL METHODS I §1.5 can be calculated without intermediate recording. In this type of calculation (as in any calculation involving a repetitive procedure) it is essential to fix the positions of the decimal points in the setting register, the multiplier register, and the accumulator, subject of course to the condition that 10x10 = 10. Care has to be exercised when transferring numbers from the work sheet to a calculating machine or vice-versa, or from one part of the sheet to another. Common errors occur from transposing digits (e.g. 8452 may be transferred as 8542), or from repealed digits (e.g. 8455 may be trans- ferred as 8445). Wherever possible the work should be reduced to a routine. Tabulation is a great help in this respect. It is usually best to arrange a tabulation so as to do only one thing at a time. Differencing procedures are often useful in checking (§ 7.3). Ex. 1.3. Find f (I -518) by linear interpolation, given that /(I -300) = 0-752 15, /(I -600) = 0-746 39. Also find the value of x for which f(x) = 0-750 61. If we wish to evaluate y = f(x) by linear interpolation given y { = f(x t ) and y 2 =/(v 2 ), we mean that we wish to evaluate y assuming that the point (.v, y) lies on the straight line joining (.v,, >•,) and (x 2 , y 2 ). It is easily verified that the required formula is y = {(.v, -.v)>' + (A-.v )j' I }/(.v 1 -.v ). (1.3) For computational purposes it is desirable to rearrange this formula as }' = }'o+P()'i - >o). P = (v--v )/(.v, -.v ). (1.4) (This has been explained in Ex. 1.2. Compare (1.1), (1.3) and (1.2), (1.4) with b = (a,-*), a = (.y -a), D = y„ A = y ). §1.5 ACCURACY AND ERROR 13 In this example h = 0-21 8/0-300 = 0-727, to three figures. A safe procedure on a calculating machine is as follows. Set 0-752 15. Multiply by 1000, where we use three decimal places since // is specified to three places. Clear the multiplier register. Calculate y l —y mentally (-0005 76). Set 0005 76 and turn the handle once backward (since }\-y is negative). Check that the accumulator now reads /(1. 6) = 0-746 39. This checks the setting of /(I -3) and the mental calculation of y t — y , in- cluding its sign. The multiplier register now reads 1 000. Turn the handle until it shows 0-727. The number in the accumulator is /(l -518) = 0-747 96, rounded to five decimals. To answer the last part of the question, which involves " inverse linear interpolation ", turn the handle till the accumulator shows the number nearest to 0-750 61. The number in the multiplier register is then 0-267. The required value of a- is therefore 1 -300 + 0-3 x 0-267 = 1-380, to three decimal places. Ex. 1.4. Evaluate the following polynomials to four decimal places for x = 0(0-2)1-0: /,(*)== a-J.y 3 +.;.y 5 , f 2 (x) = 0-995 36v- 0-288 69.v 3 + 0-079 34.* 5 , /j(.v) = 0-994 95.V- 0-287 06.v 3 +0078 04.v 5 . Compare the results with the corresponding values of f(x) = arc tan .v. The notation x = a(p)b means " from the value x = a to the value x = b, inclusive, at intervals of p in a- ". Since four decimal places are required in the answer, five decimals should be carried in the working. We write /.(*) = ix(3-x 2 +0-6.x 4 ). The values of/,(.v) can be found systematically by means 14 NUMERICAL METHODS §1.5 §1.6 ACCURACY AND ERROR 15 of a tabulation with the following headings: i* /i(x) The value of $x is calculated mentally. The values of x 2 and .v 4 can be obtained from Barlow's Tables j The value °Ui(x) is obtained in one desk machine operation involving two additions and two multiplications. Three numbers are recorded in addition to x and/,(.v). It is convenient to set .v 2 , x* on the machine and to use 0-6 and J.v as multipliers. For /,(.v) we can proceed similarly except that it is slightly more convenient to write f 2 (x) = a(0-995 36-0-288 69.v 2 + 0-079 34a 4 ) and the table heading g# is now omitted. The evaluation of f 2 {x) is carried out in one machine operation involving three multiplications and one addition. Two recordings arc required besides x and/ 2 (.v). It is convenient in this case to set the five decimal place coefficients in the setting register, and multiply by .v 2 , x*. If we had to evaluate /,(*) for .v = 0(0-171)0-5*, say, then .v 2 , .\- 4 would have to be computed on the desk machine so that the procedure in the last paragraph would require five multiplications and one addition. In this case we should probably prefer to use nested multiplication: a lX + a 3 x 3 + a s x 5 = x^+ufai + fls")}. where u = a 2 . This requires four multiplications and two additions. Only one recording is required besides a- and /■>(*)• In hand computing, the number of decimal places and significant figures needed at various stages of the calculation have a decisive effect on the organisation of the calculation. In an automatic computer all numbers are treated alike irrespective of the number of significant figures t Edited by L. J. Comrie, E. & F. Spon (London) (1952). I they contain, and the method used for evaluating a poly- nomial will not depend on whether the coefficients or the values of x arc simple. The polynomial/, (a - ) consists of the first three terms of the Taylor scries expansion of arc tan x. The accuracy of a truncated Taylor series decreases with increase in the dis- tance from the point about which the function is expanded. This is illustrated by the numerical results in Table 1.1. The polynomial / a (x) gives a much better approximation over the whole range of a- although it is not so accurate as/,(.v) Table 1.1 Results for Exercise 1.4 .V 0-2 0-4 0-6 0-8 10 fix) 01974 0-3805 0-5404 0-6747 0-7854 /(-v)-/,(.v) -2 -32 -201 -813 M-fM + 6 -6 + 2 -6 f(x)-Ux) +7 +1 -6 + 1 -5 near x = 0. The polynomial / 3 (.v) is derived in Ex. 7.5 by truncating a Chebyshev series. It is almost as good a representation of arc tan x over the whole range as the polynomial f 2 {x) which is the optimum polynomial in the minimax sense explained in § 7.5. (See also Ex. 3.6 and the reference given there.) Polynomial representations like fi(x),fi(x) are particularly useful in automatic computing. § 1.6. Mistakes. The main part of this section is concerned with mistakes due to faulty computing technique when using a desk calculating machine. No matter how carefully a computation is performed there is always some risk that mistakes will occur and it is essential to include checks throughout a calculation to ensure that mistakes will be detected as quickly as possible. 16 NUMERICAL METHODS I §1.6 Experience is the best teacher in showing the beginner where mistakes are likely to occur and how mistakes can be avoided. It is sufficient to mention some of the more obvious sources of error— illegible ligures, untidy arrange- ment (particularly when figures are arranged in columns), unsystematic work, trying to do too many steps at once. To some extent computing technique will depend on the individual and the following notes represent personal preferences. It is convenient to work in ink on paper ruled in rectangular boxes, or on squared paper with numbers inserted in suitable boxes formed from the squares. A mistake is corrected by drawing a line through a wrong number and inserting the corrected value above the wrong one. This is to be preferred to the use of pencil and rubber for various reasons. Use of ink encourages accuracy. It is useful to know when mistakes have been corrected, and exactly what corrections have been made, partly because it is easy to make further errors when correcting mistakes. Apart from anything else the work-sheet will soon look unsightly if too many mistakes are made and this is a sign that the computer should take more care— or even start again from the beginning. A point that is often neglected is to include sufficient explanation on the sheets— a calculation should be intel- ligible a year later. It is advisable to include details of all the steps in the calculation in sequence— scribbling in odd corners of the sheet to obtain intermediate results is to be discouraged. It is precisely in these intermediate results that mistakes are likely to occur. When an answer has been obtained it may be easy to check whether it is correct, for example by substitution. On the other hand the only method of checking may be to repeat the calculation. (Straightforward repetition by the same person is not satisfactory since there may be a tendency to repeat certain kinds of mistake if a calculation is repeated in the same form by the same person.) Between these two §1.6 ACCURACY AND ERROR 17 extremes it is not unreasonable to ask that ten or twenty per cent of the time spent on a calculation should be devoted to checks. It is often convenient to arrange a calculation so that a difference check can be employed (ij 7.3). One has to be doubly careful when considering the correctness of results obtained by means of automatic computers since usually an automatic machine produces answers without any detailed information about inter- mediate steps. It is therefore very easy to be unaware of loss of accuracy due to the particular way in which the calculations have been performed inside the machine. It is important to try to develop a kind of " numerical common-sense " to detect obvious blunders and to keep a mental check whether the results of a calculation arc turning out as one would expect. The length of time taken over a computation is of little importance compared with the necessity for obtainina a correct answer. Examples I Ex. 1.5. Perform each of the following calculations on a desk machine without recording intermediate results. State the accuracy of each answer and round off to the appropriate number of significant digits: 40963- 53-41 -492-0+ 63-41 5, (93-05x4-0732)-(62-l4xO-2154)-(676xO-0873), 2-4076 x 0049 1 8 x 0000 963, (0-4362 x 0029 432) -=-0-007 44. Ex. 1.6. Find 739-^/38 to five significant figures, given v /39 = 6-2450, ^38 = 6-1644. Ex. 1.7. A change of 8x in x produces a change of approximately (df/dx)8x in f(x). Hence show that if in 18 NUMERICAL METHODS I §1.6 the determination of cot 0128 the argument is correct to the number of figures stated, the cotangent is indeterminate to within 003. Check this result by finding from tables the values of cot 0-128 and cot 0-1285. Similarly find the degree of determinacy in the following quantities: sin 1-560, cos 1-560, tan -1 150, log 1-083, cxp (-2031). Find also the number of significant figures in each quantity and check your results from tables. (The point of this example is to remind the reader that when y = f(x) is read from a table the number of significant figures in y is not necessarily the same as the number of significant figures in .v.) Ex. 1.8. Evaluate (cosh 0-cos 0)/(sinh 0-sin 0) to five significant figures for - 0-3741 by expanding appropriate terms as power series in 0. Ex. 1.9. Distinguish between fixed- and floating-point calculations and compare the maximum errors involved in the two systems when computing (i) the sum of n numbers (ii) the product of// numbers (iii) the sum of// products of the form a ( b,. CHAPTER M ITERATIVE METHODS. WITH APPLICATIONS TO THE SOLUTION OF EQUATIONS §2.1. Introduction. In this chapter we introduce an important class of numerical procedures, namely iterative methods. These are illustrated by describing typical iterative procedures for solving transcendental and algebraic equations. (Equations like cot x = 2.v, x In x = 4, con- taining transcendental functions, arc called transcendental equations. A simple algebraic equation is .v 3 = 3.V+2. All of these equations can be written in the form/(.v) = 0.) The first step in an iterative solution of an equation /(.v) = is to locate the roots approximately. An easy method for locating any real roots is to draw a graph of the function y = f(.\) and find the points where the curve cuts the .Y-axis. Sometimes it is convenient to write f(x) = in the form/,(.v) = f 2 {x) and to plot two functions;- = /,(.v) and y = /,(.v). The real roots of/(.v) = are given by the abscissae of the points of intersection of the two graphs. Various methods are available for the numerical solution of equations and it lies outside the scope of this short book to describe and compare even the more important of these. It is not implied in this chapter that the iterative methods which we discuss are to be preferred to any other methods for solving equations numerically. If an equation is complicated and results are required to only a few decimal places there may be little point in using, for instance, a rapidly convergent iterative method like the Newton- Raphson process of § 2.3. It may be much more convenient 19 20 NUMERICAL METHODS I §2.2 §2.2 ITERATIVE METHODS 21 and satisfactory to use repeated plotting on successively enlarged scales (compare the last paragraph) or the method of false position (Ex. 2.12). On the other hand, when finding the complex roots of high-order polynomials it may be desirable to use sophisticated methods such as those of Graeffe or Bernoulli. § 2.2. A simple iterative method. Consider the equation a- 2 -5.y+2 = (2.1) which has the roots a, ji, say, where a = 2-5-74-25 = 0-438 447, ft = 2-5 + 74-25 = 4-561 553. In order to illustrate a simple iterative procedure we shall solve (2.1) in the following way. The first step is to localise the roots approximately. By evaluating y = .v 2 -5a-+2 for x = 0, 1, 2, ... we find that (2.1) has one root between and 1, and one root between 4 and 5. We then rearrange (2.1) in the form .v = 0-2.v 2 +0-4. This equation can be solved iteratively by writing *,+, = 0-2.y 2 +0-4, r = 0, 1, 2, ... (2.2) where x r denotes the rth approximation to a root. We start by choosing, say, a = as an approximation to one of the roots of the quadratic. If this is substituted on the right of (2.2) we obtain .y, = 0-4. This value is then substituted on the right of (2.2) to give .y 2 , and so on. In this way we find a- 2 = 0-432, .Y 3 = 0-437 32, .y 4 = 0-438 25, (2.3) where in the last two calculations wc have worked to five decimal places. It is clear that, as r increases, x, is approaching the smaller root a of the quadratic. If on the other hand we start with a- = 10 we find x t = 20-4, x 2 = 83-6, x 3 = 1398-2, and as r increases x r increases without limit. Before considering the reasons for these results we give some definitions. A calculation of the form -Y r+1 = F(x,), r = 0,1,2,..., (2.4) where we start from an assumed value for a , and then calculate x b x 2 , ... in succession, as indicated in the example above, is called an iterative procedure. If as r tends to infinity x r tends to a definite value then the pro- cedure is said to converge. If as r tends to infinity x r does not converge (for example if x r tends to infinity) then the procedure is said to diverge. We examine theoretically the convergence of the iterative procedure (2.4) by a method which will be useful in later sections. On setting x = x t +&x„ where a- is an exact root, equation (2-4) becomes a— <5A- r+1 = F(x-8x r ). From (2.4), if the procedure converges, X m f(x), (2.5a) (2.5b) On expanding the right-hand side of (2.5a) in a Taylor series and using (2.5b) we find 5x, + l = 5x r F'(x)-l-5x?F"(x) + .... (2.6a) In this section we assume that F'(x) is non-zero. When the difference between x and x, is small we can ignore the higher-order terms on the right of (2.6a) which then gives dXf+t ss 5x r F'(x). (2.6b) 22 NUMERICAL METHODS I §2.2 Repeated application of this formula gives Hence if | F'(x) | < 1 we obtain the result that Sx r tends to zero as r tends to infinity i.e. the iterative procedure converges. A more rigorous derivation is the following. Instead of the infinite Taylor series which led to (2.6a) we use the Taylor series with remainder (or equally the mean value theorem for derivatives)!: F(x-6x r ) = F(x)-dx r FXZ), where f lies in the range x to .v r = .v-<5.\- r . It is assumed that F(£) exists in this range and that F(£) is continuous in the range including the end-points. Then (2.5a, b) give, instead of (2.6), <5.v r+1 = 5x,F'(0 where I; lies between x and .v r . Suppose that I F'(0 | < M (2.7) for the range of g in which we are interested, where M is a constant independent of r. Then |5.v r+1 |<M|«5x r |<M p+1 |^o|. (2.8) If M< 1 then Sx, tends to zero as r tends to infinity. A sufficient condition for convergence of the iterative pro- cedure (2.4) is therefore that | /•"'(£) | < 1 for { near the required root. It is instructive to illustrate this result graphically as in Fig. 2.1(a-d) where we have drawn y = x and y = F{x) for several different functions F{x). Starting from an arbitrary value x of x on the line ;' = x we move vertically to the curve y = F(x) and then horizontally back to the line y = x. From (2.4) this gives a,. Repetition of the pro- cedure gives .Y 2) .r 3 , ... in succession. It is easy to see whether t J. M. Hyslop, Real Variable, Oliver and Boyd (I960), § 6.2. §2.2 ITERATIVE METHODS 23 the procedure converges or not. From the diagrams in Fig. 2.1 it is clear that convergence to the required root can occur only in cases (a) and (b), namely when | F'(x) \ < 1 and the first approximation is sufficiently near the required root. Even if there are mistakes in the intermediate steps the above iterative procedure will give the correct final answer provided that the iteration ultimately converges to the required root (cf. Ex. 2.14). On the other hand mistakes may delay convergence or cause convergence to another root. One simple check that should be applied mentally in hand computation is that, as the calculation proceeds, the x r tend to a limiting value monotonically as in the case of Fig. 2.1(a), or they lie alternately on either side of a limiting value as in the case of Fig. 2.1(b). We now return to consideration of the iterative pro- cedure (2.2) for which F(x) = 0-2a- 2 +0-4, F'(x) = 0-4.v. There are two roots a«0-44 and /J%4-6 so that | F'(a) \ < 1 and | F'O?) | >1. Hence the above theory tells us that if we try iterating with an initial approximation near a the procedure will converge to this root but if we start with an initial approximation near/? the procedure will not converge to p. The iteration can be examined graphically as in Fig. 2.2(a). Proceeding as for Fig. 2.1. it can be shown that if we choose an initial approximation .v such that \ x o\<P then the iterative procedure will converge to the smaller root a, but if | .v | >fi the procedure will diverge. There are of course a large number of ways in which we can rearrange an equation/ (a) = in the form x = F(x) and each rearrangement will give an iterative procedure of type (2.4) with, in general, different convergence properties. Thus if we rearrange (2.1) in the form .v = 5-(2/x) we have the iterative formula - v r+i — 5 . 24 NUMERICAL METHODS I §2.2 In this case F'(x) = 2/.v 2 so that | F'(x) | > 1, | F'(j?) | < I, and the procedure will converge to the larger root /? if we start with an initial approximation near the larger root but it will not converge to the smaller root a if we start with an approximation near to a. The situation is illustrated in (a) 0^F%x)<l y-- n.o , Fio. 2.1 (c) F%\)>\ (d)F(.v)<-l Diagrams illustrating the iterative procedure .v r| , - F(.v r ) Fig. 2.2(b) from which it is clear that the procedure ulti- mately converges to the larger root whatever first approxi- mation x is chosen. Still another example is illustrated in Fig. 2.2(c) which shows the iterative solution of the trivial equation x = -x 3 . The procedure converges to the root x = if we choose I a- | < 1 , but diverges if we choose | .v | > I . 3 2.2 ITERATIVE METHODS 25 The discussion given above has illustrated the following points in connection with iterative procedures: (i) If we wish to obtain a specific root of an equation it may be necessary to try various iterative methods obtained by rearranging the equation f(x) = in the form x = F(x) in different ways until we obtain a sequence which converges to the required root. (a) xr+i - 0-2*»0-4 (b) .v,+ i - 5-(2/.v,) (c) »+l - -x} Fio. 2.2 Examples of the iterative procedure x r + 1 = F(x r ) (ii) It is important from a practical point of view that convergence should be as rapid as possible. For rapid convergence of the procedure (2.4) discussed in this section it is clear from (2.8) that | F'(v) | should be as small as possible, and .v should be chosen as near to the required root as possible. (iii) In considering whether an iteration is converging or not it may be necessary to ignore the first few iterations since the procedure may appear to diverge initially, even though it ultimately converges. (iv) In the iterative procedure considered in this section the correct final answer is obtained even though 26 NUMERICAL METHODS §2.3 there are mistakes in the intermediate steps, provided that the iteration ultimately converges to the desired root. The rate of convergence of iterative methods of the type discussed in this section tends to be rather slow, and in practice a device is used to accelerate the convergence, as discussed in § 2.4 below. § 2.3. The Newton-Raphson iterative method. Let X be an approximate root of f{x) = and suppose that the exact root is x = X+f,, so that h is a small quantity. Then f{X+h) = and, expanding by Taylor's theorem, we have AX + h) = = AX) + hf'(X) + \h 2 f"{X) + .... (2.9) If /; is small and f(x) is a well behaved function in the neighbourhood of the required root, the terms in I, 2 , ft 3 , ... can be neglected, so that The Newton-Raphson iterative formula is obtained by writing this result as an iterative procedure: *""*-$$ <2J0) where, as in § 2.2, we start with an estimate .v and calculate in succession .v,, x 2 , .... The geometrical interpretation of the Newton-Raphson method can be seen from Fig. 2.3(a). Suppose that the curve /(*) = cuts the .v-axis at A, so that OA = x. If -v, = OC, then CD «/£g and tan CBD =/'(*,)• Also BC = CD cot CBD. Hence OB . OC-BC = .W(.v r )//'(.v,). so that OB = .v r+I , from (2.10). The diagram indicates that once the procedure starts to converge towards a root, §2-3 ITERATIVE METHODS 27 it will converge from one side. Under exceptional circumstances this may not be true, for instance if/" = 0, (/"//') <0 at the root. Also the first few iterates may oscillate on either side of a root, as in Fig. 2.3(b). Difficulties may arise if/'(v) = near a root of/(.v) = which it is required to determine, particularly when the equation has two nearly equal roots in this region. These difficulties can often be clarified by plotting the curve (a) (b) Fio. 2.3 Convergence of the Newton-Raphson iterative procedure y = /(•*) for values of x near the required root and ex- amining the geometrical interpretation of the Newton- Raphson procedure. Thus in Fig. 2.3(b) if we start from the abscissa P we certainly do not converge to either of the roots at E, F though of course the procedure converges to another root of the equation. It is usually sufficient to ensure that the .v r approach the required root of the equation from the side opposite to that on which f'(x) = 0. If /'(.v) is small near the required zero of/(.\) = then it may be necessary to compute /(.v r ) and/'(-v r ) to a high degree of accuracy in order to obtain an accurate estimate of *,+ ,. This indicates that the equation is " ill-conditioned ". The meanings of the words " ill-conditioned " and " accurate " are discussed in connection with polynomials in § 2.6. 28 NUMERICAL METHODS I §2.3 8 2.3 ITERATIVE METHODS 29 so that To examine the rate of convergence of the Newton- Raphson method we use the results in § 2.2, (2.4)-(2.6). To apply these to the present case we must set /"'(.v)=/(.v)/"(.x)/{/'(x)} 2 = 0. F*(x) =f"(x)lf'(x)*0, in general, where in both equations we have used the fact that/(v) = Hence from (2.6a), since 6x, - x-x r , x -Xr + i = -ip(x-x r ) 2 + ..., where />= /"//". (2.11) If x r is nearly equal to the required root x, the higher order terms can be neglected and the error in A r+1 is proportional to the square of the error in x r . This means that the rate of convergence of the Newton-Raphson procedure is very much more rapid than the rate of convergence of the simple iterative method discussed in § 2.2. This is illustrated in the following examples where the final estimate of x is obtained by repeating the iteration until .v r+1 and x r agree. Care may have to be exercised when using this criterion when working to a fixed number of decimal places, as discussed in § 2.6 below. Some general aspects of rates of convergence, including estimation of the rate of convergence, are discussed in the next section. Ex. 2.1. Find the roots of .v 2 -5.v+2 = to five decimal places by the Newton-Raphson procedure. If we choose/ (a) = .v 2 -5.v+2 then (2.10) gives x r+l — _ *?-2 2x r -5 On taking a = we find .v, = 0-4 and .v 2 = 0-438 095 * 3 = 0-438 447, x 4 - 0-438 447. The extremely rapid convergence is evident. (Compare results (2.3) from the procedure in § 2.2.) On starting from .y = 40 we find X, = 14/3 and x 2 = 4-564 104, x 3 = 4-561 554, * 4 = x s = 4-561 553. In this example it is convenient to work to six decimal places throughout. In calculating x 2 it would have been sufficient to have worked to three decimal places, but we did not know this beforehand. In more complicated examples it may be convenient to work to a small number of decimal places in the earlier iterations until the procedure starts to converge rapidly. When using the Newton-Raphson procedure it may not be necessary to recompute /'(.v) at each iteration. It may be sufficient to use an approximate value of/'(v)- Thus in the above example we can write x r+\ — x r — x f 2 -5x f +2 (2x r -5)* ' where the star denotes an approximate value. On choosing x = we find a-, = 0-4. To obtain a simple formula we set (2v,-5)* = -4 and use the iteration .v r+1 - x,+l(x;-Sx, + 2) = i(A r 2 -A r + 2). With .v, = 0-4 this gives X a = 0-44, .v 3 = 0-4384, .v 4 = 0-483 449. These figures should be compared with those obtained previously. Although the rate of convergence is of course not so rapid as in the true second-order procedure, it is still remarkably good considering the crudeness of the approximation used for 2a, - 5. Although we have approxi- mated tof'(x) it is essential to calculate/ (a) to full accuracy. It often happens in iterative processes that we can be relatively crude in certain parts of the calculation, but accuracy must be maintained in certain vital parts of the calculation. 30 NUMERICAL METHODS §2.3 Ex. 2.2. Find v /ll to five decimal places by an iterative procedure. We wish to solve the equation .v 2 = a. On writing /(.v) = x 2 — a the Newton- Raphson formula gives .v r+1 = x r -±(x r - -J = ±(x r + - where the first form is convenient for hand computation since x,-(a/x r ) is a small quantity. If we choose .v = 3 then .v, = 3-33, x 2 = 3-316 65, x 3 = 3-316 62, where x 3 is correct to five decimals. This is a very con- venient method for finding square roots on a desk calculator. Ex. 2.3. Solve x = exp (l/.v) to five decimal places. Various formulae can be obtained for the iteration by rearranging the equation in different ways. If we choose /(.v) = x-exp (1/x) the Newton-Raphson formula gives x, +1 = .v,-x^{x r -exp(I/.v r )}/{x 2 +exp(l/x r )}. (i) If we write x = l/y so that/OO = 1 ->• exp y, we find JV+ 1 = >V+{exp {-y r )-y,W +y r ). (ii) If we take logarithms and write /(.v) = 1-x In x, then *r+ 1 = v r +(l -x. In .v,)/(l +ln .v f ). (iii) Forms (ii) and (iii) are more convenient than (i). It is found graphically that there is only one root near x = 1-8 (or>> = 0-6). With initial approximation y = 0-6 formula (ii) gives successively 0-568, 0-567 144 i.e. x = 1-763 22. With initial approximation .v = 1-8 formula (iii) gives successively 1-7634, I -763 22. The final iteration should of course be repeated as a check. 5 2-4 ITERATIVE METHODS 31 § 2.4. General aspects of iterative procedures. If an iterative procedure for finding a quantity x by calculating successively x,, x 2 , ... is such that x-x r+1 = a k (x-x,) k + a k+l (x-x,) k+, + ... (2.12) where a k is a non-zero constant, we say that the iterative process is of the fcth-order. We see from (2.6b) that the simple procedure in § 2.2 gives in general a first-order process, and from (2.11) that the Newton-Raphson procedure is second order. The order of the iterative procedure x r+ , = F (x r ) for finding a root x can be obtained by expanding F(x r ) in a Taylor series around x, as in (2.6a). If the first non-zero derivative of F(x) is of the Arth order then the iterative procedure is also of the kth order, since (2. 1 2) and (2.6a) then agree, with k\a k = (-l)* +1 F ( *>(x). If F(x) has a simple form it may be easy to determine the order of an iterative procedure by finding the order of the first non- vanishing derivative, as in the following example. If F (x) is complicated it may be simpler to obtain the expansion (2.12) directly, as in the argument leading to (2.17) below. Ex. 2.4. Show that x r+ , = x r (3-3ax r +a 2 x r 2 ) is a third- order process for the calculation of I /a. We have F{x) = 3x-3ax 2 + fl 2 x 3 , F'(x) = 3(flx-l) 2 , F"(x) = 6a(ax- 1), F'\x) = 6a 2 . The equation x = F(x) has three roots, namely x = 0, 1/a, 2/a. If x'= the process is first-order with F'(x) = 3 > 1, and therefore the iteration will never converge to x = 0. Similarly if x = 2/a the process is first-order with F'(x) = 3 > 1 and the iteration will not converge to x = 2/a. The remaining possibility is x = 1/a and then F'(x) = F"(x) = 0, F"(x) yt 0, so that we have a third-order process for the calculation of 1/a. 32 NUMERICAL METHODS §2.4 §2.4 ITERATIVE METHODS 33 We shall use the following notation for the approximate relations between the errors in x, and .v r + , for first- and second-order procedures, respectively: .v-* r+ , = M(x-x r ), x-x r+l = N(x-x r ) 2 . (2.13) (2.14) A second-order process is extremely rapidly convergent once the error becomes small enough. If the error at any stage of an iteration is S, the error in successive iterates in a second-order process is NS 2 , N 3 8\ N 7 S 8 , ... N~\NS) 2 \ .... Hence the procedure is convergent if | NS | <1, and the smaller | NS | is, the more rapid the convergence. The term | NS | 2 " ultimately goes to zero extremely rapidly as r increases, for any value of | NS | less than unity. If NS = 01, then the sequence (NS) 2 , (NS) 4 , (NS) 8 , ... gives 001, 0-0001, 000000001, .... If NS = 0-9 we have, approximately, the sequence 0-81, 0-656, 0-430, 0185, 0054, 0-0029, 0000 008 4 The convergence is initially slow but ultimately rapid. Once the state of rapid con- vergence is reached, each iteration will roughly double the number of correct figures in the answer, but it is misleading to state without qualification that second-order processes double the number of significant figures per iteration. For instance the iteration may be started with a value of x so far from the required root that the second-order analysis may not apply to the first few iterations. In this case the initial convergence may be very slow. In fact one of the problems in applying second-order processes is to arrange that the state of rapid convergence is reached after as few iterations as possible (compare Ex. 3.1, Program 3.5). In first-order processes the behaviour of the error is completely different from that just discussed in connection with second-order processes. If the initial error is 5, the errors in a first-order process are successively MS, M 2 S, M 3 S, .... M'S, ... . The error is reduced by a constant factor M in each iteration. This means that there is no ultimate extremely rapid con- vergence as in second-order processes. Because of the comparatively slow convergence of first- order processes it is often desirable to improve the rate of convergence by the following important method. From (2.13) we assume that, for the first-order procedure under consideration, X-x r =M(X-x r . l ), X-x,^ = M(X-x r . 2 ), where X is an approximation to the root. On eliminating M and rearranging we obtain an expression for X in terms of x r , .Y r _,, .v r _ 2 which will be, in general, a better approxi- mation to the root than any of the iterates. We find X = X r X r-2~ x r- i X r — 2.X,_i+X r _ 2 = X, (*,-*,-, ) 2 (2.15) -x r -2.x r _,+x r _ 2 ' where the second form is more convenient for calculation (compare Ex. 1.2). The use of (2.15) is usually called Aitken's 5 2 -process since A. C. Aitken popularised its applications in numerical analysis. It goes back at least as far as Kummcr in 1837. As an example consider the figures (2.3): x 2 = 0-432, .y 3 = 0-437 32, .v 4 = 0-438 25. (2.16a) Substitution in (2.15) gives A' = 0-438 25 -(0-000 93) 2 /(- 0-004 39) = 0-438 45. (2.16b) 34 NUMERICAL METHODS I §2.4 If we use this value as a new .v in (2.2) and start a new sequence of iterations we find that the first iterate repeats the value 0-438 45 so that the iteration converges to this estimate of the root. (It is convenient at this point to anticipate notation introduced in § 7.2 in order to explain the name " 5 2 - process ". Suppose that the numbers (2.16a) arc arranged in a difference table: P x, S *r-a 0432 00 532 .Y,_, 0-437 32 93 Xr 0-438 25 -439 The quantity x r -x r _, (= 93) is the first difference Sx,^ and x r -2v,_ 1 +.v,_ 2 (= -439) is the second difference (5 2 x r _ ,, in units of the fifth decimal place. These numbers occur in (2.16b) above. The name " 6 2 -process " comes from this connection with differences.) This type of procedure, namely using several successive iterates to obtain an improved estimate of the root, which is then used as an initial approximation for a further sequence of iterations is called accelerating the convergence. It must be remembered that in the derivation of formulae for accelerating convergence we assume that round-off errors are negligible. The formulae will not be valid if round-off errors are comparable with the differences between successive iterates. Although wc stated at the end of §2.2 that mistakes are usually not very important in iterative procedures this is obviously no longer true when accelerating procedures are used. An accelerating procedure may make convergence worse if the assumptions on which the procedure are based are not satisfied. Thus Aitken's <5 2 -proccss is based on (2.13) which assumes a first-order §2.4 ITERATIVE METHODS 35 procedure. Consider the following figures from Ex. 2.1 illustrating the Newton-Raphson second-order procedure: .v, = 4-666 667, x 2 = 4-564 104, x, = 4-561 554. The c5 2 -process indicates that .v 3 is in error by 0000 065 whereas the error is in fact only 0000 001. To examine the convergence of Aitken's process, suppose that a first-order procedure gives three successive iterates x , x,, x 2 which are used to obtain a new initial value x{,° by means of (2.15). From xj, 1 ' we deduce x ( ,°> x 2 " by the first-order procedure and then use (2.15) to obtain x{, 2) , and so on. From the definition of a first-order procedure we have X-X, = fl,(x-x ) + a 2 (x-x ) 2 + ..., x-x 2 = a l (x-x l ) + a 2 (x-x 1 ) 2 + ... = flj(x-x )+fl 1 a 2 (l + a,)(.t-x ) 2 + ..., where a x , a 2 are constants. Hence x 2 -2x,+x = -(l-a I ) 2 (x-x ) + fl 2 (2-a l -a 2 )(x-x ) 2 + ..., (x 2 -x t ) 2 = fl 2 (l- fll ) 2 (v--Vo) 2 + 2fl 1 a 2 (l-2fl l -fl?)(x-x ) J + .... On substituting these results in (2.15) with r = 2 we find X-x4" - C(x-xo) 2 , (2.17) where C is a constant. Similarly, in general, x-x£ +,) = C(x-x< ") 2 , so that Aitken's <5 2 -process for accelerating convergence is second-order. The question of whether to use a first-order process with 36 NUMERICAL METHODS I §2.4 §2.4 ITERATIVE METHODS 37 acceleration of convergence, or the Newton- Raphson second-order process, depends on the particular equation considered. As a broad generalisation, iff'(x) has a simple form then the Newton-Raphson procedure will be prefer- able but in complicated examples the accelerated first-order procedure may be simpler. Second-order processes are so quickly convergent that it is seldom necessary to consider third- or higher-order processes, although procedures of arbitrarily high order can be constructed for any given iterative problem. When using a first-order iterative procedure it is often useful to estimate the quantity M in (2.13), after each iteration. By subtracting the equations x-x r = M(x-x,-,), .v-.v f _, = M(x-x,- 2 ), we find M = *'"*'-' = -\ (2.18a) X r _|— .X r _ 2 / say, where /, the reciprocal of M, has been introduced since it is usually more convenient to work in terms of / rather than M. By inspection we see that equation (2.15) for the <S 2 -process can be written in terms of t as *-*+fZJ &-*-!> (2.18b) As an example the figures in (2.16a) give t = 532/93 » 6, X = 0-438 25+}(0000 93) = 0-4384 4 . From (2.6) we see that, when solving the equation x = F(x), the theoretical value of M is F'(x). In this example the theoretical value of / is therefore t = \jM = l/(0-4x)%6. However in practice it is useful to compute the empirical value of t after each iteration since if the successive estimates of t tend to a constant value this confirms that ^-extrapolation will be valid. In a similar way when using a second-order procedure it is useful to estimate N in (2.14) after each iteration. Since a second-order procedure converges very rapidly we estimate N by replacing x by x, in the equation This gives x-x,-i = N(x-x r - 2 ) 2 . Nx(x r -x,- t )l(x,-x,- 2 ) 2 . (2.18c) Similarly to estimate x we assume that we can replace x by x r on the right of the following equation : This gives x— x r = N(x— x p _i) 2 . .x = x r +M-v,-* r -,) 2 , (2.18d) where it will be assumed that the value of N is estimated by means of (2.18c). In terms of the quantity t defined in (2.18a) this formula becomes (2.18e) though now l will not tend to a constant value independent of r as r increases. As a numerical example consider the following figures from Ex. 2.1: X, = 4-666 667, x 2 - 4-564 104, x 3 = 4-561 554. Then N ft) 0-0025/(0- 1) 2 = i, x-x 4 « i(00025) 2 = 0-000 001 6, which indicates that x t should be correct to within about two digits in the sixth decimal place. Alternatively we 38 NUMERICAL METHODS I § 2.5 calculate / = 01/00025 = 40, x-x 4 « 0-0025/41 2 = 0-000 001 6, as before. The above formulae should not be applied until the iterative procedure has started to converge systematically. § 2.5. Real roots of polynomials. The general poly- nomial equation of degree n is (assuming a ¥= 0): §2.5 ITERATIVE METHODS 39 />„(.*) = a x" + a l x''- , + ... + a„ = 0. (2.19) The following properties of polynomial equations should be borne in mindf: (i) A polynomial equation of degree n has exactly n roots. (A factor (.v— a)" gives a repeated root which must be counted in times.) (ii) If the equation has real coefficients, complex roots occur in pairs. If p+iq is a root then p-iq is a root. An equation of odd order has at least one real root, (iii) If the roots arc x t , x 2 , ...x„ then -*,+x 2 + . .. + *„= -(a,/a ), x l x 2 ...x„ = (-iy(aja ), £x,x 2 ....v, - (-l) r (a r /a ), r = 2 to «-l. In this equation the sum is taken over all possible products of r distinct .v, chosen from the n roots, (iv) If one root is much smaller than the others then it is given approximately by a n _ 1 x+a„ = 0. Similarly the two smallest roots are often given t H. W. Turnbull, Theory of Ft/nations, 5th Edition (Oliver and Boyd, 1957). approximately by the roots of a n ^ 2 x 2 +a n . 1 x + a H = 0. If one root is much larger than the other it is given approximately by a^x+a^ = 0. The two largest roots are often given approximately by a x l +a i x+a 2 = 0. The Ncwton-Raphson formula can be applied to the polynomial equation (2. 19). If u is an approximate root of this equation then an improved root is given by u-{P„(u)IP-(u)}. (2.20) It is desirable to systematise the calculation of P n {ii)jP' n (u). We consider only real u. Suppose that division of P„(x) by x-u gives Then Also Hence P n (x) = (x-u)Q n .,(x)+B. P*(u) = B. / , n 'W = Q™-iW+(v-«)0:_ 1 W. Suppose that division of Q n -i(x) by x-u gives <? M -iW = (x-u)R n -,(A:)+D. Then Pnd') = Q n -i(u) = D. From (2.20) the improved value of u is therefore given by u-B/D. The calculation of B by long-hand division, which also ■40 NUMERICAL METHODS I § 2.5 gives the coefficients of &-»(*)« proceeds as follows: X-U) Oq b -b u b, <>2 b } -b t u "3 «3 (bo *i b 2 where b = a , b x = a^b^i, b 2 = a 2 + b,u, etc. The evaluation of B is essentially the evaluation of the poly- nomial P„(u) by nested multiplication: P M = [{(a u + a l )u + a 2 }u + a 3 ']ii + ..., mentioned in Ex. 1.4. It is convenient to tabulate the calculation in the form given below, where b„ = B = P„(u). We include also the calculation required for P' n {u) = D given by c„_, in the table a o, a 2 a 3 b Q u b t u b 2 u "o bi b 2 b 3 c u CjU c 2 u Cl c 2 c 3 ft.-, b„ (2.21) «b c»-l The sequence is to calculate in succession b , b u, b u b,u, b 2 , etc. When the student has gained some practice it will be possible to form b, = a r +b r . t u in one machine operation without recording Z> r _,M, and similarly for c r , so that the second and fourth rows can be omitted in the above table. If several roots are required the roots should be deter- mined in the order of increasing absolute value, and it is usually satisfactory (and desirable) to remove them from the polynomial as they are determined. (See Ex. 3.3 where this procedure is discussed in connection with automatic computing.) The polynomial obtained by dividing P n (.\) ITERATIVE METHODS 41 §2.6 by x—u is P n _ 1 (x) = 6 x"- , + b 1 x'- 2 + . .. + &„_„ so that the coefficients of P n _,(x) have been determined in the course of the Newton- Raphson procedure for finding the roots of P n (x). Ex. 2.5. Find the real root of the following equation to four decimal places. .v 5 + 2-653.v 4 +4-512.v 3 -2043.v 2 -0-263.v-0-251 = 0. A graph shows that the real root is near 0-5 and we take this as ,v . The procedure leading to the table (2.21), without the second and fourth rows, gives, on using five decimal places throughout for convenience: 1 2-653 00 4-51200 -2 043 00 -0-263 00 -0-25100 (0-5 1 315300 608850 100125 0-237 62 -0132 19 I 3-653 00 7-915 00 4-958 75 2-717 00 1 2-653 00 4-512 00 -2043 00 -0-263 00 -0-25100 (0 549 1 3-202 00 6-269 90 1-399 18 0-505 15 026 33 1 3-75100 8-329 20 5-97191 3-783 73 The first correction is <5, = 0132 19/2-71700 = 0049, giving .v, = 0-549. The second correction is S 2 = -0026 33/ 3-783 73= -0-006 96 giving .v 2 = 0-5420 4 . A further iteration gives a final value of 0-5419, correct to four decimal places. § 2.6. Errors when finding roots of polynomials. When we wish to specify the accuracy of an approximation A" to a root x of an equation f(x) = there are at least three different ways in which the word " accuracy " can be interpreted: (i) Assuming that/(.v) is specified exactly then | X-x \ is the absolute accuracy of X and | X-x |/| x | 42 NUMERICAL METHODS §2.6 §2.6 ITERATIVE METHODS 43 is the relative accuracy of X. We shall normally use the word " accuracy " in one of these senses. (ii) If we do not know x then the smallness of the quantity f(X) = R, say, gives some indication of the accuracy of X. The quantity R is called the residual. It is most important to realise that small R docs not necessarily mean that the absolute or relative error of X is small. This is discussed below, (iii) If the constants defining the equation f(x) = (for example the coefficients, if/ (a) is a polynomial) are not known exactly, then the uncertainty in these constants means that there is an uncertainty in the values of the roots of /(.v) = 0. The accuracy of an approximate root can be no greater than the uncertainty produced in the roots by possible variations in the constants defining the equation. In some cases, with reference to (ii), small values of R can be produced by values of X appreciably different from the exact roots; and with reference to (iii) small variations in constants can produce large variations in the roots. These phenomena often appear together. In either case the equations are said to be ill-conditioned. As an example consider the equation / (x) = x 2 - 1 064.V + 0-283 - 0, (2.22) which has roots 0-527, 0-537 to three decimal places. The equation x 2 -10641.v+0-283 = has roots 0-523, 0-541; changing a coefficient by I in 10,000 has changed the roots by 1 in 130. If we substitute x = 0-532 in (2.22) then/ (.y) = R = -0000 024; changing x by 1 in 100 produces a residual which is one ten-thousandth of the smallest coefficient. It should be emphasised that i I these phenomena are inherent in the equation; they appear whatever method is used to solve the equation. Since in the above example a change of 0001 in a root produces a change of only 0000 005 in R, we need to work to six decimal places in order to verify that a root is accurate to three decimals. To examine the reason for this more closely, we note that in the Newton-Raphson method the correction to the root is -f(x)/f'(x). In the above example, although /(.v) = R is small, /'(*) is also small. For ex- ample, if we choose x = 0-525, f(x) m (0-525) 2 - I -064(0-525) +0-283 = 0000 025, /'(a) = 2(0-525)- 1 064 = 0014. We encounter the familiar difficulty that in computing f(x) and /'(v), the cancellation of large numbers gives small numbers, and in order to obtain a moderate degree of accuracy for their quotient we need to work to a large number of decimal places. This is not a criticism of Newton's method. The difficulty will occur in one form or another, whatever method is used to solve the equation. To illustrate by a numerical example, consider the formula obtained by applying the Newton-Raphson formula (2.10)to(2.22): x 2 -0-283 2x,-l-064 = 0-55, working to four decimal places, Iteration with .v gives 00195 v, = 0360 = 0-5417, x 2 = 00104 00194 = 0-5361, (2.23) and then in succession 0-5366, 0-5326, 0-5833, 0-5575, 0-5451, 0-5382, 0-5403, 0-5361, .... This last number, which is .v, , is in fact identical with x 2 and further iteration will merely repeat the values for x 3 to .y 10 . Even though 44 NUMERICAL METHODS I §2.6 we have used four decimal places, all we can say about the root is that it is near 0-54. Equation (2.23) shows clearly the way in which significant figures arc lost in intermediate calculations. Similarly if five decimals arc used in the calculation it will be found that the iterates vary between 0-536 50 and 0-537 50 in general, and we can say that the root is approximately 0-537. There is no guarantee that the correct answer lies between the upper and lower limits within which the iterates fluctuate, or even that if an iteration apparently converges that it converges to the correct answer. Thus in the above example if we iterate with .v = 0-50 using four decimal places, we find in succession 0-5156, 0-5244, 0-5263 and all further iterations repeat 0-5263, but we know that the correct answer in this case is 0-5270. To sum up this discussion: in order to calculate any root to a specified degree of accuracy an essential and unavoidable degree of accuracy is necessary in the inter- mediate calculations and this may be much greater than the final accuracy of the root. The change Sx in a root x produced by a change 8a r in a coefficient a r can be examined theoretically as follows. The quantity x+5x is a root of the perturbed equation, so that P n (x+5x)+8a r (x+8xy- r = 0, Hence to first order, since P„(x) = 0, or P&x)8x+5 ar x n - r = 0, 8x = -8a r {x"-'IPXx)}. (2.24) When {x"~ r IPI,(x)} is large for any root, then by our previous definition this root is ill-conditioned. Of course the linear approximation on the basis of which (2.24) is derived breaks down for ill-conditioned roots. Equation (2.24) is not quantitatively accurate for the polynomial (2.22), for §2.6 ITERATIVE METHODS 45 example, but the qualitative indication of ill-conditioning is correct. Formula (2.24) indicates that the smaller roots tend to be better conditioned than the larger roots since, if X; is the /th root, the power x"~ r usually increases more rapidly with i than PJ(x,). Some instructive illustrations of the application of (2.24) arc given by J. H. Wilkinsonf from whose paper the follow- ing examples are quoted. In this paragraph we assume that x lt x 2 , ... x„ represent the n roots of a polynomial of" degree n. (Elsewhere in this chapter x, denotes the rth approximation to a root in an iterative procedure.) If we consider the polynomial equation ^o(-v) = (-v+ 1X-V+2). . .(.v+ 20) = 0, (2.25) with roots .v, = -s, (s = 1 to 20), we find for a change «5a, in a, : 8x 20 a 0-4 x 10 8 &7,, (5.x, 5 w 0-2 x 10 ,0 &i„ 8x 5 « 0-6<5a,. The roots .*, to x s are well-conditioned, but the conditioning is extremely poor for the large roots. If we consider the polynomial equation (.•c + l)(x+2)...(x+20)+2-"x 19 = 0, which differs from (2-25) by about 10" 7 in a u the roots are x t = — s to nine decimal places for s = 1 to 4, but .v 10 , x tl are — 10-10±0-64/, .v 15 ,.v 16 are —13-99 + 2-52/, and. v I9 ,a.- 2 o are — 19-50+1 -94/. Among other things, this example illustrates in a remarkable way that some roots can be ill-conditioned and others well-conditioned, in the same equation. Although it is true that ill-conditioning is usually t The evaluation of the zeros of ill-conditioned polynomials, Numcrische Mathematik, 1 (1959), 150-180. This excellent paper will repay careful study. 46 NUMERICAL METHODS 8 2.7 associated with the occurrence of nearly equal roots, which in turn implies that f'(x) is small, this remark should be interpreted with care. Thus the equation .x 20 + l =0, which has 20 roots, all of modulus unity, is comparatively well-conditioned. This can be confirmed by using (2.24). § 2.7. Bairstow's method for finding complex roots of polynomials. In this section we extend the Newton-Raphson method for real roots (§ 2.5) to give formulae convenient for dealing with complex roots. Suppose that w is an approximate complex root of P n (x) = 0, where the poly- nomial has real coefficients. Consider the quadratic factor (x— w)(x— w) = x 2 — px—q, say, where w is the complex conjugate of w, so that p, q are real. We write (cf. (2.19), (2.20)) P„(x) = (x 2 -px-q)S n . 2 (x) + Ax+B, (2.26) S„_ 2 (.x) = (x 2 - P x-q)T n _ A (x) + Cx + D. (2.27) Hence P„(w) = Aw+B. Also P#e) = _2x- P )S,,. 2 (x) + (x 2 -px-q)S' n . 2 (x) + A. Since p = w+w this gives Pfov) = (w-w)(Cw+D) + A. From the Newton-Raphson formula the correction Sw to the root w is Sw = — Aw + B (w-w)(Cw+D)+A In general Sw is small, so that the numerator of this expres- sion is much less than the denominator and this implies that the term A in the denominator can be neglected. I 5 2.7 ITERATIVE METHODS 47 Hence we can write Sw = — Aw+B (w-w)(Cw+D) (2.28) When we examine the division of P n (x) by a quadratic factor x 2 —px—q in detail, it is found convenient to modify the above formulae in the following way. Suppose that in (2.26) P n (x) = a x"+a l x''- , + ...+a„, S„_ 2 (.v) = b x"- 2 + b l x"- 3 + ... + b n . 2 . Then from (2.26), b = a a and b i = "j+P&o. b r = «,+/>!>,_, + gb r _ 2 , (2£r£«-a (2-29) A = a.-j+l»^,-a+«&»-3. B m a n +qb„. 2 . If we extend (2.29) to hold for r — n- 1 and n, thus defining &„_, and b„, we see that A = b„. i , B = b„-pb„. 1 . Similarly suppose that when we divide x 2 —px—q into S„~ 2 (x) to give (2.27) the coefficients of 7;_ 4 (a) are c , c u ... c„_ 4 , and we define c n _ 3 , c n _ 2 by the same rule as for c r (r = 2 tow -4). Then we find C=c n _ 3 , D = c n . 2 -pc„- 3 . On using these results in (2.28), remembering that w+w = p, we obtain Sw = — K-^-b„ 0v-n'Xc,,_ 3 >v-c._ 2 ) (2.30) NUMERICAL METHODS §27 §2.7 ITERATIVE METHODS 49 The computational scheme for the b r and c r is as follows (cf. (2.21)): a a, a 2 a 3 pb pb y pb 2 qb qb x K *l *2 b i pc pc t pc 2 c c, c 2 c 3 a«-2 « n -i a n pb„-3 pb„- 2 pb„-i Rb n . i <//>„. x qb, , l'„-2 *»-i b„ (2.31) Instead of working in terms of the correction Sw to the root it is convenient to work in terms of dp, Sq, corrections to the quadratic factor. We have (.v-H'-<5iv)(.x-M'-<$iv) = x i — (p + Sp)x-(q + Sq), so that Sp — Sw+8w, Sq b — n'<5iv — u»5»v — ^w^iv. On neglecting the second-order term in the last equation we find, on substituting from (2.30), that where <5p = (fc„c„_3-/) n . lCn _ 2 )/A, (2.32a) <5<7= -(b n c„- 2 -b„_ t E)IA, (2.32b) A - C 2 _ 2 - Cfl _ 3 £, (2.32c) E = pc n - 2 + qc„. 3 . (2.32d) These are Bairstow's formulae. They are equivalent to the following simultaneous linear equations for Sp, Sq: c„. 2 5p + c„- 3 Sq = -*>„_„ (2.33a) ESp+c„. 2 5q = -b n . (2.33b) ( The derivation given here has shown the very close relation- ship between Bairstow's formulae and the Newton-Raphson method. Bairstow's method depends on locating a quadratic factor approximately before iterating to obtain the accurate factor. It should be noted that methods such as root- squaring (Graeffe's process) and the Aitken-Bernoulli process are available for calculating complex roots of polynomials even though no initial approximations are knownf. Ex. 2.6. Find the complex roots of smallest modulus for the polynomial equation in Ex. 2.5, to four decimal places. As an approximation to the required quadratic factor we take the last three terms of the polynomial which give (.v 2 + 01 287a- +01 229). The procedure in the table (2.31), omitting the second, third, fifth, and sixth rows, which do not need to be recorded, gives 100000 2-65300 4-51200 -2043 00 -0-26300 -025100 1000 00 2-524 30 4064 22 -2-876 30 -0-392 31 152 99 100000 2-395 60 3-63301 -3-63829 From (2.32) we find £ = 002175, A = 13158 14, Sp = -0 066 23, Sq = 041 65. On computing the new/), q and repeating the procedure we find P a K-\ K —0194 93 -008125 0003 26 -0006 83 -0-195 74 -0-083 12 -0000 01 +0000 03 Further iteration indicates that /;, q arc accurate to one unit t Sec, for instance, Modern Computing Methods, H.M. Stationery Office, 2nd Edn. (1961). so NUMERICAL METHODS I J 2.7 in the fifth decimal place, and solution of the corresponding quadratic gives the complex roots 0-0979 + 0-27 12/, to four decimal places. Examples II Ex. 2.7. Find the real roots of the following equations to four significant figures (i) x = cos x, (ii) .v 2 -l = sin a, (iii) X» = 5. Ex. 2.8. The equation x* + 1-5.\- 1-5 = can be written in the forms (i)x = l--fx 3 , (ii) x = l-5(l-5 + x 2 )-\ (iii) x = (l-5-l-5x)*. On applying the method of § 2.2 find which of the corre- sponding iterative procedures are convergent. Hence determine the real root of the equation to four decimal places. Ex. 2.9. If the root of the equation cot .v = kx between and \k is small show that it is given approximately by x«(Jfc+|)~*, where k is large. If the root is near {it show that it is given approximately by x«in(l +k)~ l where k is small. Use these results as first approximations to calculate the roots of the following equations to the appropriate number of significant figures, assuming that the coefficient of -v in each case is correct to the last digit given: (i) cot x = 111-lx, (ii) cot a: = 1-lllx, (iii) cotx = 0011 llx. §2-7 ITERATIVE METHODS S1 Ex. 2.10. Use the Newton-Raphson method to obtain the following iterative formulae for x = a 1 '", (a>0): f{x) m x"-a, x r+1 = n- l {(n-i)x,+ax}- n }, fix) = t-(«/x"), x r+1 - »- , {( n +l)x r -<^ 1 x:: +, }. If x = x r +5x r show that in each case 5x r+l = /» r 5.v 2 for suitable p r . Deduce that if n> 1 or«< — 1 the two sequences approach x from opposite sides. Illustrate these results by calculating n*, «""■ by the two sequences, as accurately as possible, given n = 3141 593. Ex. 2.11. Show that x r+ , = (x r 3 + 3ax r )/(3x 2 + a) is a third-order procedure for the calculation of yja. Illustrate this result by calculating ^11 to nine decimal places, using x = 3. (Compare Ex. 2.2.) Ex. 2.12. Let ;•, = /,(*), y 2 = f 2 (x). Show that the chord joining (x,, ;■,) to (x 2 , y 2 ) cuts the x-axis at This is the basis of the well-known method of false position! for finding the roots of/(x) = 0: the value of y 3 =/(xj) is computed and the procedure is repeated using (x 3 , y 3 ) instead of either (x,, ;•,) or (x 2) y 2 )- It is convenient to choose the two values of x at each stage so that they lie on opposite sides of the root. The method is essentially inverse linear interpolation. Use this method to solve the equations in Exs. 2.1,2.3. Ex. 2.13. A method due to J. H. Wcgstein:!: which improves the convergence of the iterative procedure of t E. T. Whittaker and G. Robinson, Calculus of Observations Blackie (1944), p. 92. X Comm. Assoc. Compui. Mach. 1 (1958) No. 6, p. 9. See also G. N. Lance, Numerical Methods for High Speed Computers, IlilTe (1960), pp. 134-138. 52 NUMERICAL METHODS I § 2.2 for solving the equation x = F(x) is given by X n+l^n-l~^n X n §2.7 §2.7 ITERATIVE METHODS 53 -^n+l — (n = 1, 2, ...), x n+l~ ■*n + ^'n-l x n where x„ +1 = F(X n ), and X , X t are arbitrary. Show that this procedure is simply a systematic version of the method of false position described in Ex. 2.12. Ex. 2.14. Consider the following sequential calculation. I f we define a =\, fc = (l + .v 2 )*, ", + . = i(fl r +b r ), b, +l = frfed*, r = 0, 1, 2, .... then it can be shown that, as r tends to infinity, lim a r = lim b, = x/arc tan x. This is not an iterative method of the type considered in this chapter since the limit depends on the starting values. In particular it is vital to avoid mistakes in the above calculation. Ex. 2.15. Let ±x u ±\,, ±x 2 , ±x 2 denote the roots of x 8 -2(2p 2 -0-3)x 6 +(ie*+6p 4 -6p 2 +3-73).x 4 -2p 2 (2p 4 -3-7p 2 + l-7)x 2 +pV-l) 2 = 0. Calculations on an automatic computer gave the following results (Q = 100): forp = 4, x, = 50- 159 + 49-842;, .v 2 = 0-155 39H 0-15445/, forp = 10, xi = 51 017 + 49022/, .v 2 = 1-0139 + 0-9745/. Veiify that for the smaller roots the formula e{i+ 6 ' 2(p2 - > 2 -p 1 *2 -<^'-^- 2j mr} gives results correct to 1 in 1 5,000 for p = 4 and 1 in 5000 for p = 10. Derive a formula accurate to 1 in 10,000 for the larger roots. (The moral of this example is that even though numerical results can be obtained by an automatic computer this should not blind one to the possibility of obtaining an adequate approximate answer in analytical form.) CHAPTER III ELEMENTARY PROGRAMMING FOR AUTOMATIC COMPUTERS § 3.1. Introduction. For the purposes of this book it is unnecessary to go into detail regarding the structure of electronic digital computers; it will suffice to make some general remarks about the way in which an automatic computer works. An automatic computer has to be told precisely, and in detail, the steps which have to be performed to solve any given problem. From this point of view a calculation consists of two parts: a series of instructions specifying the method to be used, and a sequence of numbers leading from the data of the problem through various steps in the computation to the answer. It will be assumed that instructions and data are punched on cards which have to be fed into the machine, and that the results of a calculation are printed out by the machine on an output sheet of paper. For our purposes a digital computer can be considered to have the following components: (a) A memory or store in which numbers and instructions can be stored and from which any required number can be produced at will. Each number in the memory is allocated a memory position or address which can be regarded as a pigeonhole in which the number is stored. (b) An input mechanism for transferring the information on the input cards into the memory. 54 §3.1 ELEMENTARY PROGRAMMING 55 (c) An output mechanism for transferring information from the memory onto the output sheet. (d) An arithmetic unit for carrying out simple basic arithmetical operations such as addition, subtrac- tion, multiplication, division. (e) A control unit which organises the calculations, i.e. arranges for the input and output of information and the execution of arithmetic operations in the correct sequence as specified by the instruc- tions. The set of instructions for solving a problem is called the program. When solving a problem the computer begins by storing the complete program in its memory via the input mechanism. It then proceeds to obey the instructions in an order determined by the program. The basic instructions which a machine obeys are usually very simple. For instance it will need at least one, and perhaps several, basic orders to execute the operation " Add the two numbers contained in positions a and b in the memory and place the result in memory position c ". The reader will appreciate that to program a problem in terms of " basic machine language " would require con- siderable background knowledge of individual machines. It would also be very laborious. Fortunately there is an increasing tendency to produce systems by means of which a programmer can write his program in a relatively simple language resembling ordinary algebra for the formulae and ordinary terminology for the instructions. This simple language must of course be interpreted in terms of, or translated into, basic machine language before the computer can solve any given problem, but this interpretation or translation can be done by the machine itself using a " compiler ". The programmer need only prepare his program in the simple language. We shall describe a simple system of this type, devised for teaching purposes. 56 NUMERICAL METHODS 9 3.2 § 3.2. Simple programs. Consider the evaluation of ! = 1 J{R 2 + [2nfL-U(2nfC)Yy (3.1) for various sets of numbers R, L, C, f. It will be assumed that each set of R, L, C, f is punched on a separate input card. A program for this calculation can be written in the form of five statements as given in Program 3.1. (Instead of " statement " the words " instruction " or " order " can be used, if preferred.) •v PROGRAM 3.1 Label Statement 9 Read/?,L, C,f II = 6- 2832 fL- 1/(6- 2832 fC) Print/, R,L, C,f Go to 9 The first statement makes the computer read the first card containing input data. Four numbers are on this card in the form of four sets of punched holes. The machine associates four memory positions with the symbols R, L, C,f and stores the four numbers from the input card in the corresponding memory positions. From now on, instead of saying " The number in the memory position associated with the symbol R " we shall say " The number associated with R " or, shorter still, " The number R ". The second statement says " Perform the calculation on the right-hand side of the equation, using the numbers just associated with L, C, f. Store the result in a new memory position associated with the symbol H ". The third state- ment has a similar meaning: it computes and stores /. These two statements could of course be written as one §3.2 ELEMENTARY PROGRAMMING 57 single statement, but it is slightly clearer to write them separately. The fourth statement makes the machine print the computed value of i and the values of R, L, C/from which / has been derived. The last statement transfers control to the statement numbered 9; in other words it instructs the computer " Go to the statement 9 and do what it says: then perform the statement following 9. and so on". In Program 3.1 this means that the computer reads the next input card, which contains four new values of R, L, C, f. The computer replaces the old values in its memory by the new values and repeats the sequence of operations. The old values are lost completely when they are replaced by new values. Instructions concerning numerical calculations, like the second and third statements in Program 3.1, are called arithmetic statements. The other instructions, which organise the numerical calculations, are called control statements. The arithmetic statement a = (A + 2-431)/j- (3.2) says "Compute the number a by adding 2-431 to x and dividing by y ". Hence x and y must have occurred previously in any program in which this statement occurs, since the machine must go to memory positions already allocated to the symbols x and y in order to find the numbers represented by these symbols. The precise meaning of the statement depends on whether the symbol a has occurred previously in the program. If a has occurred previously there will be a memory position already associated with a. The meaning of (3.2) is then " Obliterate the number in the memory position associated with a and replace it by the number computed from the right-hand side of the equation". If a has not occurred previously the meaning is " Allocate a memory position to the new 58 NUMERICAL METHODS I 8 3.2 symbol a and place in this memory position the number computed from the right-hand side of the equation ". It is clear that, although arithmetic statements resemble ordinary algebraic formulae in appearance, the " = " sign is used in a different sense. No confusion can arise from this because programs contain only statements (i.e. instruc- tions), never algebraic formulae. The rules governing arithmetic statements can be summarised as follows: (i) It is essential to have only one symbol on the left-hand side of a statement. (For instance (3.2) cannot be written ay = x +2-431.) This symbol may have suffixes, e.g./,,/ r , a,j. If the suffix r is a variable parameter, for example if /• = 1, 2, ... «, the value of r must have been specified earlier in the program. (ii) Any symbol on the right-hand side of an arithmetic statement must have occurred previously in the program. (Incidentally this is why we have written 6-2832 in Program 3.1 instead of 2n since the symbol n has not been defined in the program. We could of course have written n = 3- 14 16 as the second statement and then the third statement could have been H = 2nfL- ]/(2nfC).) (iii) Apart from (i) and (ii) any arrangement of symbols or notation which is self-explanatory according to the ordinary rules for writing and interpreting algebraic formulae will be allowed. We next consider an extension of Program 3.1. Suppose that for any given set of R, L, C we wish the machine to compute first the value of/ for which i is a maximum, namely, from (3.1), /' = 1 2rr(XC)* - §3.2 ELEMENTARY PROGRAMMING 59 and then evaluate i for f=rh, r= 1(1)«, h = F/n, where n is a given integer. The notation r = l(l)n, which we have already met in Ex. 1.4. means that r starts at unity and increases by steps of unity until n is reached. A suitable set of instructions is given in Program 3.2 which contains only one new type of statement in addition to those in Program 3.1. Programs 3.2(a) and 3.2(b) compute the same quantities. We first of all consider 3.2(a). The interpretation of the first three statements has been covered by the discussion of Program 3.1. The fourth statement Do up to 3 for r = 1(1)/? instructs the machine " Perform all statements which immediately follow, up to and including the statement numbered 3, n times, the first time for r = 1, the second time for /• = 2, and so on, and the last time for r = n. Then go on to the statement following statement number 3." The first time round, the machine therefore computes and stores/,, //,, /,, the second time/ 2 . H 2 , i 2 , and so on until finally /„, //„, /„ are computed and stored. When this is done the computer will print the/j and i t (.? = 1 to n) and /?, L, C. The part r = 1(1)h of the " Do up to " statement obeys rules which are similar to those governing an arith- metic statement, namely that the symbol on the left may or may not have occurred previously in the program, but any symbols on the right must have occurred previously in the program. Compared with Program 3.1 the " Print " order in Programs 3.2(a), (b) has been extended to print a sub- scripted variable with the range of suffixes clearly indicated. Commas are used to separate independent variables but the word " and " is used to separate subscripted variables with l he same range. 60 NUMERICAL METHODS I $3.2 §3.2 ELEMENTARY PROGRAMMING 61 The numbers " 9 " and " 3 " in the " Label " column of Program 3.2(a) are simply identifying tags attached to certain statements, so that these statements can be referred PROGRAM 3.2 (a) Ladfi. &TATEMBN i 9 Read R, L, C, n F=0-I5915 (LC)-* // = F/n Do up to 3 for r = 1(1) n /, - rh //, = 6-2832/,!- 1/(6- 2832 f r C) 3 /, = (/? 2 +//,*)-* Print R, L, C, //,/, and /, for s = 1(1) n Go to 9 m Labh. Statement 9 Read R, !, C, n F = 0- 15915 (!C)-* li - /> /=0 Do up to 3 forr = 1(1) n / = /+// // - 6-2832/!- 1/(6- 2832 fC) 3 J r = (/? 2 + // 2 )-* Print fl, !, C, ff, F, /;, /, for s = 1(1) n Go to 9 to in control statements such " Go to " and " Do up to " in tliis example. For labels we use positive integers. These need not be in increasing order of magnitude. As labels, instead of numbers we could equally well have used Greek or Roman letters. In Program 3.2(a) the machine stores H r (r = 1 to n) al- though these are used only for immediate calculations, and are not required after the corresponding i r have been found. Storage space can be saved by arranging the computation as in Program 3.2(b) where for illustrative purposes we have decided that storage and printing of all the/, (r — Hon) is also unnecessary. Consider the statement This means " Take the number in the memory position associated with / and add /; to it. Place the result in the memory position associated with /". The old number which was in/ is overwritten. Any subsequent reference to / applies to the new number which has just been stored in the memory position associated with /. Program 3.2(b) should now be self-explanatory. Suppose finally that in addition to the calculation just considered we wish to evaluate /, for f r = rh, r = «+1,m + 2, ... W, where Wis defined by the inequalities l N <plR£t„. u (p<l, given). (3.3) The maximum value of /, is \jR, and i r is monotonically decreasing for r>n, so that we wish to compute i r for increasing r until the value of/, falls below some prescribed fraction of the maximum value of /. This can be done as in Program 3.3 where the first eight statements are the same as in Program 3.2(b) apart from the addition of a parameter to the " Read " statement. The point of this example is that the number N in (3.3) is not known beforehand. The machine has to determine N by computing /, for s = n+ 1, n+2, ..., each time com- paring the current value of i a with p/R until i,<pjR. This 62 NUMERICAL METHODS I §3.2 §3.2 ELEMENTARY PROGRAMMING 63 procedure is controlled by an " If" statement: in^p/Rgoto 14 which instructs the machine: " If i s ^p/R transfer control to statement 14 as in a 'Go to' statement. If i a <p/R perform the statement immediately following the ' If ' statement, namely ' Print . . . '." In the second half of Program 3.3 we have used s, g instead of r, f. This is not essential but it emphasises that the program consists of two distinct parts. The value of s which is printed is the value assigned to the variable s when the " Print " statement is reached, and this is the same as the number N in (3.3). The possibility of programming Yes-No decisions by means of " If" statements is one of the distinctive features of performing calculations by means of an automatic com- puter. The " If" statement allows the machine to control its own calculations. From our point of view the main difficulty in program- ming does not lie in the arithmetical calculations themselves but in the organisation of the sequence of calculations. It is often helpful to draw a flow-diagramf to clarify the logical structure of the program to be used. This is illustrated for Program 3.3 in Fig. 3.1 which should be self- explanatory. A flow-diagram consists of boxes connected by lines. The boxes are essentially of two types. A test box or If-box has one input line and two output lines. The output line which is selected depends on whether the answer to the question in the box is " yes " or " no ". Any other kind of control or arithmetic statement is placed in an assertion box. It is convenient to make test boxes oval and assertion boxes rectangular, as in Fig. 3.1. In contrast to the relatively limited kinds of statement used in programs, any kind of statement is allowed in flow-diagrams as long t An excellent collection of flow-diagrams (or flow-charts) is contained in Mathematical Methods for Digital Computers, A. Ralston and H. S. Wilf, Wiley (1960). as the meaning is clear. A flow-diagram tends to be descriptive but the logical interconnections are described precisely. Flow-diagrams can vary from description in broad outline to description in detail. At the one extreme the flow-diagram will consist of almost equal numbers of PROGRAM 3.3 Label Statement 9 Read R, L, C, n, p F = 0- 15915 (LQ-* 1 1 = Fin /=0 Do up to 3 for r = 1(1) n f-f+k H = 6- 2832 fL- 1/(6- 2832 fC) 3 i r = l/(/? 2 + // 2 )* x = n 9 m "'' 14 .V = 5+l 9 = 9 +h II = 6-2832 gL -1/(6' 2832 gC) If i^p/R go to 14 Print s, i, for t = 1(1) s, R, L, C, n, p Go to 9 assertion and test boxes, with little detail in the assertion boxes. At the other extreme the programmer can convert the flow-diagram into a program directly by merely removing the lines and boxes, and inserting labels and " Go to " statements as necessary. In the flow-diagram in Fig. 3.1 the control instructions are given in detail but no arithmetic information is given. 64 NUMERICAL METHODS I §3.2 §3.3 ELEMENTARY PROGRAMMING 65 Start /isa \car s another inputN Nn. card available'.',-' Stop machine Yes V Read in the next values of R, A, C, «. /> Compute F. h Compute /,, r = l(l)n T Scl i = n 1 Increase s by 1 Compute /, ?\ v« Mill a, /,(; - I lo v). It. A. C, n. I' 1 Fig. 3.1 Flow-diagram for Program 3.3 § 3.3. Some programs involving iterative procedures. The problems involved in programming an iterative procedure for an automatic computer can be illustrated by considering the solution of equations by means of the Ncwton-Raphson procedure. From §2.3, equation (2.10), the equation f(x) = is solved by means of the iteration .v r+1 = x r -f(x r )!f(.x r ), where x r is the rth approximation to the root x. It is necessary to start by obtaining initial approxima- tions .v for each of the roots which we wish to determine. It may be possible to locate the roots roughly by hand calculation and include the resulting estimates in the input data for the machine, but usually it is better to program some procedure by means of which the machine itself can make initial estimates of the roots. The next step is to program the machine to perform the iteration. There are then two possibilities. The procedure may converge satisfactorily in which case the machine can be made to iterate until the difference between successive iterates is sufficiently small. Alternatively, the procedure may not converge or may converge extremely slowly, in which case it is necessary to alter the iteration in some way or, if desired, stop the iteration altogether. A useful criterion for convergence is the following.! We define e, = x r+l — x r . When an iterative procedure is converging towards a root we shall have | e r+ , | < | c, | until the e r become so small that they arc affected by the fact that only a finite number of decimal places are used in the calculation. In simple computations we may find that x r = x r+ i = A- p+2 sothat£ r = c f+1 = 0. If the calculations are appreciably affected by round-off error, however, random error fluctuations will eventually give | e r+ , | > | e, j for some r. (Compare the discussion of random error t This was pointed out to me by Dr D. J. Wheeler. 66 NUMERICAL METHODS I §3.3 fluctuations in § 2.6.) In either case we can use the con- dition | e,+ i | ^ | e, | as a criterion that the calculation has converged to the degree of accuracy allowed by round-off error. The level of random variations can be regarded as a noise-level limiting the accuracy of the calculation. An even simpler criterion can be used if we know that the .v, vary monotonically with increasing r. Suppose for example that theoretically .v r+ , > x, for all r. Since we are dealing with numbers specified to a finite number of decimal places we shall eventually have x r+l £x r due to either repetition of the iterated value, or round-off error and the noise-level of the calculation. This condition can therefore be taken as a criterion of convergence. Some of these points are illustrated in the following examples. Ex. 3.1. Program 3.4 gives a straightforward routine for finding the square root of a positive real number a by means of the following formula (Ex. 2.2): *+| = M*r + («/*r)}- If the machine is given a negative value of a by mistake this formula will give nonsensical results. In this case we make the machine stop. The value of a has been printed out, and this will indicate the source of the trouble. (This is the first time we have used the self-explanatory " Stop " order. We might have used instead an " Alarm " order to stop the machine and warn the operator that something has gone wrong with the calculation.) As an initial estimate of the root we take unity and a single iteration gives •Vo = iO + fl )- We have Xr-yja = lfe-i-V4fe-t2ft rm^UX ■••. (3-4) where for r = we must define x_j = 1. Hence x,^Ja and x?7>a. (This is the reason for choosing .v = 1(1 +a). If we choose .\ = 1 it would no longer be true that xf^a §3.3 ELEMENTARY PROGRAMMING 67 for all r^O.) Equation (3.4) proves that the procedure is second-order. From (3.4) (x-Ja) - |§A— jt|fc. r ^ Hence the procedure is convergent if 1- x, _ r-l <1. PROGRAM 3.4 Labfl Statement Read a Ifa^Ogo to 12 Print a Stop 12 x-^l+fl) 2 i/ = x X = K.v+(a/*)) If .v<« go to 2 Print x, a Since .v,_ j ^ v 'n for all r the procedure is always convergent. We have also so that successive iterates decrease until the procedure converges or reaches the noise-level of the calculation. This means that we can use the condition .r r+l S;.v r as a criterion for convergence. After the " Print x, a " order the machine goes on to the next part of the program. One disadvantage of Program 3.4 is that if a is very different from unity then the procedure is initially somewhat 68 NUMERICAL METHODS I §3.3 slowly convergent. This is avoided in Program 3.5 where by successive multiplication or division by 16 we ensure that the iterative procedure is used only on numbers between £ and 4. To obtain an accuracy of say 10" 10 it is readily verified that no more than five iterations are necessary and this fact is used in the program. The para- meter r in the " Do " statement is used merely to count the number of iterations and it does not occur in any other statement. Note that the case a = has to be considered separately. One way in which this example is unrealistic is connected with the efficiency of the method. The square root opera- tion will be provided as a standard routine for any machine and it will be used many millions of times per year. If we can save one millisecond in the time taken by the square root program this will save about 17 minutes of computer time every million square roots that are computed. Such economies can quickly build up into a substantial saving of computer time. Hence it is worth while devoting a considerable amount of effort to developing an efficient square-root program. The basic machine language will be used and advantage will be taken of any special features which allow time to be saved. The way in which a computa- tion is programmed will then depend to a large extent on the way in which calculations are performed by the arith- metic unit inside the machine. Thus if all numbers inside the machine must be less than unity in magnitude and 0<a< 1 we cannot calculate i(l+a) by first of all forming 1+a, although i + Ja would be permissible. However, considerations of this kind lie outside the scope of this book. Ex. 3.2. Draw a flow-diagram for Program 3.5. Ex. 3.3. Find the real roots of P n (x) = fl x n + a 1 x"- , + ... + a n = 0. §3.3 ELEMENTARY PROGRAMMING 69 PROGRAM 3.5 Label Statement Read a Print a Ifa^Ogo to 12 Stop 12 k = If a>4 go to 3 If a<£ go to 4 6 « = HI +«) Do up to 5 for r = 0(1)4 5 u m i(« +(«/«)) Go to 8 3 a = a/16 Jt-fc+1 If a>4 go to 3 Go to 6 4 Ifa>0goto7 .v = Go to 9 7 a = 16a k = k-l Ifa<igo to 7 Go to 6 8 x = 4"u a m 16*a 9 Print x, a 70 NUMERICAL METHODS I 13.3 §3.3 ELEMENTARY PROGRAMMING 71 In this example we shall use .r,, x 2 , ... x m to denote the real roots of this polynomial. (Elsewhere we have used x r to denote the rth approximation to a root .v.) Suppose that the following procedure is used. We first of all find the root of smallest absolute value a - , and then divide PJ. X ) by at-.v, giving a polynomial P„-i(x) of degree «— 1. We then find the next smallest root x 2 and divide / > „_,(.v) by x-x 2 obtaining P„^ 2 (x) and so on. This sequence is repeated until we obtain .v,, x 2 , ... x m and / 5 n _ m (.v) where this final polynomial has only complex roots. It is found that this method is very satisfactory numerically if the well-conditioned roots are found and removed first. It is for this reason that the roots arc found in the order of their size, starting with the smallest, since the smaller roots tend to be better conditioned than the larger. This has already been mentioned in connection with (2.24). When finding roots by factor-removal methods it might seem desirable to try to avoid error build-up due to errors in the roots which are found first by performing final iterations on the original polynomial. However, this can be inconvenient and it is not usually necessary if the well-conditioned roots are found and removed first, since a badly conditioned root is difficult to determine accurately whether we iterate on the original polynomial or on a reduced polynomial. The roots will be found by means of the Newton-Raphson procedure. This means that it is necessary to estimate first approximations for the roots. For the first approximation to *,(/•> 1) the smallest root of P„. r+1 {x) we shall choose x r _j, the value of the smallest root of P„- r+2 i. x ) which has just been found at the previous stage of the calculation. This is a reasonable procedure since if we find the smallest root at the first stage of the calculation we are likely to find the roots in increasing order of size. Also in the case where there are two roots close together, which is a trouble- some case, the first of the two roots will give a close initial estimate for determination of the second. To start the procedure we need an estimate of at, which we shall denote by (jc,) . This will be found in the following way. If | a n _ , | ^ | aj we set (a.-,) = -aja„. , ; otherwise we set (a,) = 1. To understand why we choose (*,)(> in this way we recall that (§ 2.5 (iii)) a.- i-1+A+...+ A. A large value of | a„. Ja„ | indicates that at least one of the Xj is small, and it is reasonable to estimate the smallest root by choosing (x,) = — a n /fl„_ ,. On the other hand a small value of | a„.i/a„ | does not necessarily indicate that the x, are large, since cancellation may occur. A simple example is x a +0-0001x-l = where -aja n -i = 10,000 but the roots arc of order unity. It would obviously be unreason- able to take (.v,) = -a„/fl n _, in this case. If we choose (*i)o = 1 when I a«-i I < I o n | then if the smallest root is much greater than unity the Newton-Raphson procedure should still converge to it, though initially the rate of con- vergence may be slow; and the procedure will converge to a small root if this exists and |a n _,/a n | is small merely because cancellation has occurred. The problem of how to define (x,) illustrates one of the difficulties of automatic computation. The program must tell the machine exactly what to do. Ideally we should like to be certain that the value which we tell the machine to take as (.v t ) is a reasonable approximation to the smallest root. But we can always invent examples for which any specific criterion fails. The percentage of failures can be reduced by giving the machine more complicated rules for the determination of (X|) . We have to compromise between the difficulty of programming more complicated procedures, and the desire to ensure that (a-,) is a good approximation to the smallest root of the polynomial. The suggested procedure will not always locate all the real roots. This is partly because we arc writing a general program to solve a 72 NUMERICAL METHODS I §3.3 wide class of problems. It is not surprising that we can invent examples for which the program will not work. On the whole, the more specific the problem the easier it is to write a program such that it will be virtually certain that an automatic computer will obtain the correct result. Program 3.6 gives one method for carrying out the previously described procedure for finding the real roots of a polynomial. A flow-diagram for this program is given in Fig. 3.2. Starting at the beginning of the program, when statement 3 is reached the value (.v,) defined above has been assigned to a variable u. After defining the values of various symbols the basic iterative procedure for finding the correction 8 is performed in statements 7 to 11, using the method and notation described in some detail in §2.5. If | 8/u | > 10" 4 it is assumed that the process has not yet started to converge and the machine repeats the iteration. (The figure 10~* is chosen arbitrarily as a figure much larger than the accuracy of the calculation which is, say, 1 in 10" 9 , but small enough so that when | 8/u | < 10~ 4 we can be reasonably certain that the procedure is converging. Depending on circumstances it may be more convenient to work in terms of absolute instead of relative values, say | 8 | <10~ 4 indicates convergence.) If | 8/u | <10~ 4 on the first iteration the program defines a variable £ = 8 and then repeats the iteration. In general the program sets e equal to the old value of <5 before starting an iteration to find a new <5. If the iteration is converging normally we have | 8 | < | e |. We assume that when, after a period of normal convergence, | 5 | — | £ | the process has converged and the root has been determined to the maximum possible accuracy for the method used. The current values of u and 8 are printed out, where <5 is the next correction to u and gives some idea of the " noise-level " of the calculation. (This idea could be extended by asking the machine to perform say four iterations beyond the point where | 8 | — | £ |, and record the maximum and minimum values §3.3 ELEMENTARY PROGRAMMING 73 PROGRAM 3.6 Label Statement Read a, for i = 0(1) n Print a, for t = 0(1) n u = 1 If|fl n _, | < | a„ | go to 3 " - -<hja.-i 3 b = a Co = «u /- 1 r = go to 7 £ = 5 u = u+S Do up to 10 for m m 1(1) n-j+l K = a m +b m - x ii c m = b m +c m . l u 6 = - b n-J-n/ C «-j If |5/i/| <10-*go to 12 r = r+l Ifr<20go to 6 Print u, 8, a, for i = 0(1) n-j+l 20 Stop 12 Ifr = 0goto6 If | 8 | < | c | go to 6 Print ii, 8 Ify = n go to 20 7=7 + 1 Do up to 15 for m = 1(1) n-y+1 15 4n = K m m Go to 5 5 6 7 10 11 r Read and print o, (i = to «) If- 1 " = — «' fl »-i Set b„ m r„ - o 7=1 ± Setr = Set e = 8 ii = ii+8 5 - -6,-y t , /£•„-/ r^r Is [5/i*i <10" 4 ? Increase /■by 1 Yes hr<20?) >- :isr = 07 ^_^ No ./"»- Print «, 8. «. fori = OOVi-J+l Stop Replace o„ by />„ Yes ""■ -*■ No Fta. 3.2 Flow-diagram for Program 3.6 Increase ,byl §3.4 ELEMENTARY PROGRAMMING 75 of «. Note that it is a better check on the accuracy of the calculation to print out 5 than to print out the residuals which may be small even though the roots are inaccurate, as shown in §2.6.) Once the process has converged the program tests to see if j = n which indicates that the polynomial has real roots, all of which have been found. If this is the case the calculation is stopped. Otherwise the new polynomial />„_/*) is set in place of P„. J+l (x) and the iteration is repeated. The variable u is already set up as the root of /»„__,+ ,(.v) which has just been found, and this is the appropriate first approximation to the smallest root of §3.4. General comments. We shall illustrate briefly some differences between automatic and hand computing by considering the problem of finding the real roots of a polynomial, assuming that the program of Ex. 3.3 is used for the automatic computer and a similar method is used for the computation by hand. (i) In hand computing a convenient method for locating the roots is to evaluate /(oo),/(0), /(-co) and then/(x) for suitable intermediate x. We should probably draw a rough graph of y = f(x). Although it is not impossible to program this type of procedure we have preferred a simpler method for the automatic computer program. (ii) In hand computing, when calculating the smallest root of each of the polynomials P n -,(x) we would usually try to obtain an accurate first approximation for this root rather than merely taking the root of P n _ J+l (x) found at the previous stage of the calculation. (iii) In hand computing we can control the number of decimal places used in the calculation. If the final answer is required to a large number of decimal places we would use a comparatively small number of decimal places in the early stages but increase the accuracy as the procedure 76 NUMERICAL METHODS I §3.4 converges. There is no point in doing this with an auto- matic computer. (iv) In hand computing we would decide the number of decimal places accuracy required in the final result and adjust the calculation accordingly. In automatic computing it is simpler to work to the full accuracy available in the machine and determine the " noise-level " for each root. (v) In hand computing we would check at the end of the calculation that we have located all the real roots. We have not incorporated any such check in our program. When using a computer it might be simpler to merely calculate all the roots, both real and complex, and then we should be sure that all the real roots have been determined. (vi) In hand computing a convenient check is to divide the final polynomial with complex roots into the original polynomial, which should give zero remainder, though there will in fact be a small remainder from inaccuracies due to rounding-off error. Wc can then check that the sum and product of the known roots agree with the appropriate coefficients of the dividend. In automatic computing it is probably unnecessary to apply this check. The following are some general comments. On the whole, in programming for an automatic computer we prefer a numerical method which involves a simple repetitive procedure, even though this is repeated a large number of times, to a method involving a complicated procedure which is repeated only a few times. The reverse is often true in hand computing. In hand computation the total number of operations involved is usually comparatively small and the effect of round-off errors can usually be made negligible merely by keeping one or two extra decimal places. In automatic computing the calculations tend to be large in scale and the number of arithmetical operations may be enormous. It may be much more difficult to estimate the effects of errors §3.4 ELEMENTARY PROGRAMMING 77 of various kinds and it may be important to use methods which minimise certain kinds of error. The possibility of an automatic computer making arithmetical mistakes can be ignored when using systems of the type described in this chapter. In this respect computation by an automatic machine is very different from hand computation. On the other hand errors such as loss of significant figures caused by subtracting two large nearly equal numbers to give a small one may be much more serious than in hand computation since they may easily pass undetected. It is most important to bear in mind that we have no information about what goes on in an automatic computer between the input figures and the figures which are printed out. Whereas in hand computa- tion we often recognise loss of accuracy or misleading results due to unsuitable methods, while doing the calcula- tion, an automatic computer will not do this for us unless we realise beforehand that this may occur and allow for it in the program. Other aspects of the suitability of methods for automatic computers will be mentioned in connection with the various topics dealt with later in this book. The main objects of describing the simplified program- ming scheme in this chapter are threefold. (a) To give the student some idea of how a program is organised for an automatic computer. The machine must be told in detail what to do, but it can control its own calculations to a limited extent. (b) To give the student a programming system by means of which he can write his own programs for the standard methods used in numerical analysis. We have already illustrated this in connection with the Newton-Raphson procedure and further examples will occur throughout this book. 78 NUMERICAL METHODS I §3.4 (c) To give the student some understanding of the differences between hand computing and machine computing, and of the criteria which govern whether a method is suitable for automatic computing. However, the reader should not be misled by the simplicity of the programming system and programs in this book. One of the major preoccupations of anyone using an auto- matic computer is the efficiency of the procedure used, since computer time is expensive. In practice full advantage is taken of special features of a computer and its program- ming system in order to devise the easiest and cheapest method for doing a particular problem on a particular machine. These factors often over-ride purely mathematical considerations. They lie outside the scope of this bonk. Examples III Ex. 3.4. It is required to find a real root of an equation /(.v) = by means of the following procedure. Two numbers a and h are given such that/(a)<0,/(/>)>0. We define a sequence .v„ y r by the equations x = a, y =b fo+i = i(v r +;v). JY + !=>-,. if /{M-v,+>V)}gO, [x r+l = .v„ y, + , = K.v,+>v), if /{i(v r +^)}>0. (i) Write a program to determine a root to within 10" m , where in is an integer specified in the input data, and it is assumed that the number of significant figures used in the machine will allow determination of the root to the required degree of accuracy. (ii) Write a program to determine the root to the maximum number of significant digits allowed by the accuracy of the evaluation of f(x), assuming that this is limited by round-off error. 13.4 ELEMENTARY PROGRAMMING 79 (Assume that " Read/(.v) " means that a set of instruc- tions is given to the machine for evaluating/ (at). Whenever f{y) for instance, occurs on the right of an arithmetic statement the machine will evaluate f(x) for x = y and use the resulting number in the appropriate place in the arithmetic statement.) Ex. 3.5. Write a program for finding arc tan x by the sequential procedure given in Ex. 2.14. Ex. 3.6. Write a program (or a flow-diagram) for computing sin x and cos x for any value of x, - co <x< go, given that for 0<Mgn/6 fli« sin i/ = . cos » = iS + bt + b^ + bt)- 1 ' where the a,(i = 1 to 4), b,(i = 1 to 4) are given constants. For 7r/6<2i><7i/3 use the relations cos 2u = 2 cos 2 v-\, sin 2v = cos (\n-2v). (The formulae for sin, cos, are discussed in A. Ralston and H. S. Wilf, Mathematical Methods for Digital Computers, Wiley (1960), p. 23.) Ex. 3.7. Write a program for evaluating the n smallest positive roots of the equation cot X = kx by an iterative procedure for k = a(b)c (cf. Ex. 2.9). Outline briefly some of the principal differences between the way in which this calculation could be performed on an automatic computer, and the way in which it could be performed by desk calculator. (In addition to the points in § 3.4, for hand computation we should use mathematical tables for cot x whereas in a computer functions need to be calculated. This may influence the choice of method. In hand com- puting we should check by differences as in § 7.3 below.) CHAPTER IV SIMULTANEOUS LINEAR ALGEBRAIC EQUATIONS §4.1. Introduction. Very many problems in applied mathematics can be reduced to the solution of a set of simultaneous linear algebraic equations. It is therefore of great importance to have efficient methods for the numerical solution of such systems. If we consider the equations fl M .V,+fl|,.V 2 + + «■»*„ = h„ "„!-v,+a„2-Y 2 + ». + a nn x n = b n , the solution can be written, by Cramer's rule,f Xj = A,-/A, (J = 1 to n\ A = where (4.1) (4.2) <nl 8i a. *o. and A y is A with the./th column replaced by the column of As. This solution is useful for certain kinds of problem but it is seldom used in numerical work when the coefficients arc pure numbers since much more efficient methods are available for this case. As a general rule the direct evalua- tion of determinants is laborious and inefficient and should be avoided wherever possible in numerical work. t See A. C. Ailken, Determinants and Matrices, Oliver and Boyd (1962). p. 55. 80 §4.1 linear;equations 81 The above solution is of theoretical interest. It breaks down when A, the determinant of the coefficients, is zero. In this case we may have either a multiplicity of solutions (when the Ay arc also zero) or inconsistent equations with no solution (where at least one of the Ay is non-zero, so that if we imagine a limiting procedure in which A tends to zero, the formula (4.2) gives apparently infinite .Vy). We need not go into detail since it will be assumed, when solving simultaneous linear equations, that the determinant of the coefficients is non-zero. It can be anticipated, however, that difficulties will arise if A is nearly equal to zero (see § 4.4 below). If the b t in (4.1) are zero then the Ay in (4.2) are zero, and non-zero Xj exist only if A = 0. This case is of importance in Chapter VI. It is convenient to distinguish between direct and iterative methods of solution of linear equations. The amount of computation involved in direct methods can be specified in advance whereas an iterative computation is repeated until the error is less than the required amount. Direci methods are discussed in this chapter and ihc next. Iterative methods are discussed in § 6.4. When dealing with numerical examples we shall omit the Xj and the addition and equality signs in (4.1) and write merely an array of numbers: 'i., *. "n2 When working to a fixed number of decimal places it is usually essential to arrange that the largest element in each row and column of this array is of order unity. Tin's can be done by multiplying the original equations by constants and introducing suitable (preferably simple) multiples of the unknowns. We return to this point later (§ 4.3). If the 82 NUMERICAL METHODS I §4.2 original coefficients are such that a u = fyj (i.e. if the a i} are symmetrical about the principal diagonal a n ,a 22 , • •• O the scaling should be arranged to preserve the symmetry. Ex. 4.1. Scale the equations 80*,+ 5X 2 - 0-2*3 = 2-8, 5*!- 0-2*2 + 0-0125* 3 = -00875, -0-2A', +00125^+ 000625*3 = -0025. The scaling factors that should be used are more or less arbitrary. If, for example, we introduce .xr, - 40*!, x 2 = 2X 2 , x 3 = 0-5*3, and multiply the equations by 1, 20, 80 respectively, we obtain Zv, + 2-5a- 2 -0-4.v 3 = 2-8, 2-5.v,- 2.v 2 +0-5x 3 = -1-75, (4.3) -0-4a, + 0-5.v 2 + x 3 = -2. §4.2. The method of successive elimination. The method of successive elimination is well-known in elementary algebra. Thus to solve (4.3) we can subtract 1 -25 times the first equation from the second, and add 0-2 times the first equation to the third. This gives two equations in .v 2 , .v 3 . Similarly x 2 can be eliminated from these two equations to give .x 3 . Then back-substitution in previous equations yields .v 2 , .v, in succession. The object of this section is to describe a tabular procedure, with frequent checks, for the practical application of this method when the coefficients are given decimal numbers. The calculation and the checks can be arranged in the tabular form given below. Although this is written out only for a set of four equations in four unknowns, the procedure in the general case should be clear. The square brackets will be explained later in Ex. 5.4. The numbers in round brackets are used for checks. §4.2 LINEAR EQUATIONS 83 "ll «12 "13 0,4 />. (*,) (4.4) '21 021 ^22 "23 "24 b 2 (*2) 1 «,. «3I a i2 "33 "34 *>3 (s 3 ) «.. "41 a 4 2 "43 «44 ''4 CO (S.) (Sa) (S 3 ) (S 4 ) (S 5 ) CS = s) ''22 *23 *>24 Cl (h * T 2 ) (4.5) '32 l>32 b i3 ''34 C3 ('3 * T 3 )] (4.6) kl [>42 V, *44 c* ('4 * T 4 )] (4.7) ^33 C34 d> («3 * V 3 ) (4.8) ,, [<43 C44 <L (»4 * L/ 4 )] (4.9) 0-44 e* ("4 * V A ) (4.10) Roots: x, x 2 x 3 x 4 (S' s x S s ) The first step is to form check sums which will be used later: s, = a n + a l2 +a l3 +a l4 + b h (/ = 1 to 4), Sj =a u +a 2J + a 3J + a 4J , (j = 1 to 4), S s = b,+b 2 +b 3 + b 4 , s = s, +s 2 +s 3 +s 4 , S = S t +S 2 + S 3 + S 4 + S s . We should find s = S exactly since no round-off errors are involved, and this is the first check on the numerical work. We then calculate '21 = 02i/0n. /31 ■ 031/011. '41 = 04l/0Jl- 84 NUMERICAL METHODS I §4-2 The 4x4 set of equations is reduced to a 3x3 set by eliminating the first unknown: b u = fly-/naij, (/,; = 2, 3,4), (4.11) c, = b,-l n b„ t, = ft-Jafe ( 412 > Note that the steps involved in applying these three formulae are of exactly the same kind, so that the procedure is easily reduced to a routine. This holds also for the similar elimina- tions which follow later. The b,j, c,, 1, are calculated line by line and at the end of each line we perform a check by calculating T l = b i2 + h i3 +b u +c„ (1-2,3,4), (4.13) which should agree with the f, to within a unit or two in the last significant figure, any difference being due to rounding errors. In the same way we calculate ^32 ■ b 32 /b 22 , hi ■ b A2 lb 12 . The 3 x 3 set of equations is reduced to a 2 x 2 set by forming cij = K ~ lafafi ('". J = 3. 4), (4. 14) with similar equations for d,, u ; , and a check Ui = e a + c IA +d„ (f-3,4), which should agree approximately with u t . The 2 x 2 set of equations is reduced to a single equation by "44 = c 44 — '43 c 34> with similar equations for e 4 , u 4 . Also V 4 = d 44 +e 4 xv 4 . At this stage we have reduced the original equations to S« LINEAR EQUATIONS 85 the system (4.4), (4.5), (4.8), (4.10): ail*l+«12*2 + *13*3 + «14*4 = *I b 22 x 2 + b 2i x 3 + b 24 x A = c 2 (4. 16) c 33 x 3 + c 34 .y 4 = f/ 3 ^44*4 = *4- The unknowns can now be found by back-substitution. We have -v 4 = eJd AA , * 3 = (</ 3 -c 34 .v 4 )/c 33 , etc. A check on the final answer is provided by the equation S' s = S,x, + S 2 .v 2 + S 3 .x 3 + S 4 .v 4 . S' s should agree with the previously computed S 5 except possibly for a small difference due to rounding-off errors. However, agreement between S s and S' s to a given number of decimal places does not necessarily indicate that the x t are accurate to this number of places (see § 4.4). A group of in sets of equations with the same a t] but m different right-hand sides b, k (k = 1 to m) can be solved by including m columns b lk instead of a single column b t . Only one check column is needed, obtained by adding all the elements in the corresponding rows. Ex. 4.2. A numerical example is worked on the following page. An extra figure is carried in the intermediate cal- culation, but the roots are rounded to four decimal places. The positions of numbers correspond exactly with those given earlier except for the occurrence of only a single number in the right-hand check columns. Consider, for example, t 3 and T 3 in (4.6). In the numerical calculation, instead of writing down the two numbers separately, we first of all write down f 3 using formula (4.12) which gives r 84 NUMERICAL METHODS I §4.2 U -J U oo «*< 2 £ £! — .^ r- 00 — Cl fS Ci.C-2 -•-ON* O t r~ oo o\ CI CS (S — o\ — o O ft <N p~ ci r- on \OMO\mN Cl in <N ^- SO 6666,- m*oo<M» >t rp rp ri <t 6666,-, <n r- >* r-i in op ci ts (SMlONOv in r~ ci in — 6 6 6 6 <n o ,—* -^ !C -~ -~ ■* r-- ff» <n n»j O ■*. Os ci on in — in I-- ci NO- so r— so <N ci O O 6 6 6 6 6 Tt rr r- O M VO -66 I I oo>>o r~ on r— on O O © N»Ov vomo © © o 6 6 6 0\MO O — r- O O ■* oo so r-~ N O — 6 6 6 oo s m ci ci ci S— T in . O o 6 I ON On (N in 38 6 6 2; ® ci in S3 6 6 r- _ so sO in on © on 6 M, c. on fs 00 i~- 8 S o so CO SO in m §4.3 LINEAR EQUATIONS 87 0-201 49. Wc then calculate T 3 using formula (4.13) and this gives 0-201 50. The difference between « 3 and J 3 is permissible. For checking the next stage of the calculation it is necessary to use the exact row sum T 3 , not / 3 (why not?), so we amend the recorded value of f 3 by crossing out the last two digits, and replacing them by 50. In this example, since the square array of coefficients a u is symmetrical about the diagonal line containing a xu a llt a 33' O44, each of the square arrays derived from these coefficients is symmetrical, apart from small differences due to rounding-off errors. This can be used as a check, or only half the non-diagonal terms need be computed. § 4.3. Choice of pivots and scaling. In our description of the method of successive elimination in the last section the element in the top left-hand corner of each array plays a very special part. It is called the pivot. The / /; are obtained by dividing each of the other elements in the same column as the pivot by the pivot. It is obvious that in any given array any non-zero element can be used as the pivot. Suppose that in the a tJ array wc choose a„ as the pivot. The rth row, which contains the pivot, is set aside. We calculate, from the elements in the sth column As - cja rs , (/ = 1 to 4, omitting I = r). Then O ci >n (N CI CO >— ■ Tf NfOsOsO t*- in in t ci °° t*\ 6 6 6 6 w 1 r- 5 'fr O O 1- in ci m so r- 00 OM-O ■* 30 SO O t- — rs O 2 Os oo so fS so OO OOO O O 1 " where / = 1 to 4, j = 1 to 4 omitting 1 = r,j = s, i.e. the row and column containing the pivot. The routine is exactly as before. In the elimination procedure of §4.2, equations (4.1) with n = 4 were reduced to the form (4.16). It is easy to see that the determinant of coefficients has not been altered since one determinant is obtained from the other by adding NUMERICAL METHODS I multiples of certain rows to other rows: A = §4.3 011 012 «13 014 = 011 012 013 «2I a 22 «2 3 024 /> 22 023 a 3i 032 033 034 <*33 o 41 fl 42 04 3 014 = 011 f>2 2^33044- "14 *24 *34 (4.17) This is an important result, that the determinant of the coefficients of the original equations is given by the product of the pivots. It can be proved in the same way that this result holds however the pivots are chosen, apart possibly from the sign of the determinant. It is sometimes stated without qualification that the largest element in an array should always be chosen as the pivot but the importance of this can be over-rated. Before discussing this we consider an example. Ex. 4.3. Soke the equations in Ex. 4.2 by the method of successive elimination, using the largest element in each array as the pivot. If we assume that the coefficients of the equations in Exs. 4.2, 4.3 are exact and solve the equations using a large number of decimal places in the intermediate calculations we find that the exact roots are x, = -5-3868, .v, = 2-8137, .v 3 = 11-5729, .v 4 = -6-3653, (4.18) rounded to four decimals. The differences between this exact solution and the results in Exs. 4.2, 4.3 are given in Table 4. 1 which shows that, in this particular example, the results obtained by taking the largest pivot at each stage arc less ac- curate than the results obtained by choosing the top left-hand pivot. The reason for this is the smallness of the final pivot in Ex. 4.3, namely 0002 51, which is considerably smaller S4.3 LINEAR EQUATIONS 89 vo X <N VO VO OJ ■/-> m •e vO Os -T- rl OO OO T -c" © 3 3 r-i OO 1 <N ON 00 ci rp <N fN •-* O tm <N o 6 © 6 6 6 1 6 1 T CN rri c- v© <N r- '', r» r- -r v© o t- «N •r, r- r» N >r> vO v© fS * '< VO CN b> t*l vi OO r*i 1^ r*l •n Fj •^r O o © © S © © © © 666 6 1 6 <N IT, »N OO rl _, o, o •» t «1 VI fl cc r-» v> — 1 m H ffi ~c >i o\ Ov r-1 in ■^- rr, r«"i r-1 n O o o o o vn I O © O ►? fi r- rs vi M00*(N •/I r* ci «i o o o o 5400 5233 4358 3622 192 13 195 02 012 74 Ov r- o O O vo o o o o o o © o © o o 1 r— — oo m vo VO <N s 3 vO VD O O OO VI OO fS VO OV Tl VO VO ■<r vS I o o 90 NUMERICAL METHODS §4.3 lhan 0008 93, the smallest pivot in Ex. 4.2. The size of the smallest pivot in Ex. 4.3 limits the accuracy of the roots to at most 1 in 500, in general. The reason for the very small final pivot in Ex. 4.3 is clear from (4.17) which states that the product of the pivots must be a constant. If we insist on choosing as pivot the largest number available at each stage this tends to lead to smaller pivots than might other- wise be obtained. The logical conclusion of this argument might seem to be that the pivots should be chosen so that they are all nearly equal to each other, and therefore each Tablb 4.1 Comparison of solutions obtained by the method of successive elimination *f«(*l) »-W cle. Top iefl pivots (Ex. 4.2) -00160 00087 00161 -00081 Largest pivots (Ex. 4.3) -00402 00213 00405 -00198 Top left pivots, afler -00098 00055 00099 -00052 scaling Largest pivots, afler 00148 -00077 -00149 00071 scaling pivot should be roughly equal to the /rth root of the deter- minant. Apart from the fact that this is impractical since the determinant is not known at the beginning of the analysis it is not true that increasing the size of the smallest pivot automatically increases the accuracy. It will be shown in the next section that a certain inherent inaccuracy is present in the roots of a set of simultaneous linear equa- tions if these are found by any procedure involving rounding- off errors. As long as the pivots are large enough so that round-off in the smallest pivot does not imply a relative error greater than the inherent error in the calculation, it would seem that little additional accuracy can be obtained by choosing the pivots in any special way. The essential point is to avoid very small pivots rather than to arrange §4.3 LINEAR EQUATIONS 91 that the smallest pivot is as large as possible. On the whole the size of the pivots would seem to be of second-order importance. In §4.5 below, when programming the method of successive elimination for automatic computers, we shall pivot on the largest element in the left-hand column of numbers in the array obtained after each elimination. This avoids unnecessarily small pivots, and the multipliers are all less than unity which may be convenient for some computers. The object of illustrating that the choice of pivots in Ex. 4.2 gives more accurate results lhan the choice in Ex. 4.3 is not of course to imply that the choice of the top left-hand number as pivot is to be preferred to the choice of the largest pivot at each stage. Many of the equations which occur in practice are such that, when calculating by hand, they are suitable for solution by choosing the top left-hand number as pivot even though this is not the largest possible pivot. On the other hand if the top left-hand number is appreciably smaller than the other numbers in an array then it should certainly not be chosen as pivot. Suppose next that we rescale the equations of Exs. 4.2, 4.3 by multiplying the third equation by two and the fourth equation by four and introduce new unknowns x\ = .v,, x' 2 = x a , x'i = i* 3 , *; = ].v 4 . It will be found that if the pivots are chosen as the top left-hand numbers in each array as in Ex. 4.2 the pivots are 0-5400, 0-28009, 0093 96, 0142 90. If the pivots are chosen to be the largest numbers at each stage as in Ex. 4.3 then the pivots are 70032, 0-744 65, 0155 67, 0-002 49. The differences between the exact results and the results 92 NUMERICAL METHODS I §4.4 given by the two methods are given in Table 4. 1 . It is seen that although scaling has improved the results slightly the improvement is not striking. The smallest pivot when using the top left-hand pivots is now comparatively large, but the smallest pivot obtained by taking the largest pivot at each stage is almost exactly the same as in Ex. 4.3 before rescaling. The point about rescaling is exactly the same as the point previously made about the choice of pivot. Rescaling can be advantageous in avoiding very small pivots rather than arranging that the smallest pivot should be as large as possible. Once the smallest pivot is above a certain minimum value there is little point in further rescaling. In practice if it turns out that the smallest pivot is undesirably small it is usually preferable to improve the solution by the method described in the next section (Ex. 4.4) rather than repeat the solution with a different choice of pivots, rescaled equations, or more decimal places. § 4.4. Inherent error and ill-conditioned equations. This section is concerned with a characteristic difficulty which occurs when solving linear equations. We first consider a concrete example: 1-985.Y, -l-358.v 2 = 2-212, (4.19) 0-953.y,-0-652.y 2 = 1062. Elimination of .v, gives (1 -985 x 0-652-0-953 x 1 -358).v, = - (1-985 x 1 -062-0-953 x 2-212). (4-20) On working to six decimal places we find 0-000 046.v 2 = -0000 034, x a = -0-739.... Similarly .v, = 0-609.... In quoting the answers to three decimal places we have in fact assumed that the coefficients §4.4 LINEAR EQUATIONS 93 in (4.19) are correct to seven decimal places and that we have worked to seven decimal places so that x 2 = —340/460, x t = 280/460. If after evaluating the quantities in brackets in (4.20) we had to round off to five decimal places we have .v 2 = -0000 03/0000 05 = -0-6..., and the answer is not even correct to one decimal place. Round-off imposes a fundamental limitation on the accuracy with which the roots can be obtained. If instead of (4.19) we solve 1-985*, -1-358j: 3 = 2-212, 0-953.v,-0-651v 2 = 1061, we find .v, = 3001..., x 2 = 42-41.... An alteration of one coefficient of (4.19) by one in a thousand (1062 changed to 1-061) has altered the roots by factors of about fifty to one (0-61 to 30, and -0-74 to 42). The reason for the behaviour of the above equation is clear. The second equation is nearly a multiple of the first so that we can write the equations algebraically as ax l +bx 2 = c, (ka+a)x l +(kb+P)x 2 = kc+y. where A: is a suitable constant and a, p, y are small, roots of these equations are The *i - cp—by ca—ay x 2 = ap-bct aP-bx The small numbers a, p, y have a dominant effect on the result. In computing determinants, large numbers have cancelled by subtraction to give small final results. The determinant of coefficients is small relative to the individual terms in the expansion of the determinant. (Notice that it is meaningless to say that the determinant is small — it has to be small relative to some other quantity.) 94 NUMERICAL METHODS §4.4 §4.4 LINEAR EQUATIONS 95 If X, are approximate roots of the set of linear equations (4. 1 ) we define the residuals r, as r t = Z a ij X J~ b i- J" i (4.21) For equations (4.19), If we take X, = 0-638, X 2 = -0-696 we find r, = -0-0004, r 2 = -00002. Roots which differ from the true roots by 1 in 20 produce residuals which arc less than the round-olT error of the coefficients. This illustrates the important point that smallness of the residuals does not necessarily mean that the corresponding estimates of the roots are accurate. Equations like (4.19) are said to be ill-conditioned. The opposite of ill-conditioned is well-conditioned. We talk of the " condition " or " conditioning " of a set of equations. As already illustrated, characteristic signs of ill- conditioning are: (a) To determine the roots to a specified number of decimal places when round-off error is present it is necessary to work to a much larger number of decimal places in the intermediate calculations. (b) Small alterations in the coefficients produce large variations in the roots. (c) The determinant of the coefficients is small in the sense described above. (d) Estimates of the roots which differ appreciably from the true roots produce very small residuals. Two other indications of ill-conditioning depend on ideas which will be introduced later. In terms of the inverse matrix defined in § 5.3, equations are ill-conditioned if the elements of the inverse matrix are large in the sense described at the end of Ex. 5.6. In terms of the concept of an eigen- value defined in Chapter VI, equations are ill-conditioned if the ratio of the largest to smallest eigenvalue of the matrix of coefficients is large. The equations in Ex. 4.2 arc slightly ill-conditioned (see also Ex. 5.6). If the diagonal elements a,,, a 22 , ••• o m are much greater than the non-diagonal elements a l} (i jkf) then the equations are well-conditioned. If all a (j (/' /- j) are zero then the equations are perfectly conditioned. If the coefficients vary in size it is usually convenient to arrange, by interchange of equations and/or scaling, that the largest coefficients lie in or near the diagonal positions a, , , a 22 , . . . a„„. It is useful to distinguish between two different aspects of accuracy when solving linear equations. (i) If the coefficients of the equations are assumed to be exact but a fixed number of decimal places are carried during the solution, how many decimal places are accurate in the answer? (ii) If the coefficients of the equation are inexact (for example because they arc obtained by rounding off more accurate numbers) to how many places are we justified in quoting the final answer? It is easy to answer both of these questions in any specific case, provided that the detailed solution of the original equations is available and we are prepared to do a little extra work. Suppose that in case (i) the roots obtained by working to a fixed number of decimal places are denoted by Xj and we define residuals by (4.21) as before. Suppose that the exact roots Xj are related to the approximate roots by Xj = Xj+S.Xj, (j = 1 to »). On substituting in (4.1) we readily find that the 5xj are the roots of the equations t arfxj = h,- t "ijXj = -r„ (I = 1 to n). (4.22) 7 = I 7=1 96 NUMERICAL METHODS I § 4.4 The procedure is illustrated in the following example. Ex. 4.4. Find the accuracy of the roots obtained in Ex. 4.2. The residuals are found to be: 10 5 r, = 0010, 10 s r 2 = -5072, 10 5 r 3 = -2-512, 10 5 r 4 = -2-506. The major part of the work involved in solving (4.22) has already been carried out in the table in Ex. 4.2 and it is necessary to calculate only one new 6, column. The results are almost exactly those given in Table 4.1 : <5x, = 00160, fa, = 00087, 8x 3 = 00161, fa, m -00080 5 . Hence the roots in Ex. 4.2 are accurate to within one or two digits in the second decimal place. In order to answer the question in (ii) above suppose that possible errors in a u , A, are denoted by da,,, db h and suppose that these alter the roots from Xj to Xj + fay. Then I (flu+Sorfixj+tej) - h i + Sb t- j = » On neglecting second-order terms we find n n V flyfa. = 5bi- £ f>a,jXj, (i = 1 to nr). If \j = Xj+Ej, where the X } are approximate roots, we obtain, again neglecting second-order terms, £ auSx, = sb t - t 8°u x j = »i. sa y- ( 4 - 23 ) ;= i ;■= i In many cases we can regard a solution as satisfactory if the error in the solution is comparable with the inherent error due to inexact coefficients. From (4.22) and (4.23) §4.4 LINEAR EQUATIONS 97 it is seen that a solution is satisfactory in this sense if the residuals | r, | are less than some average value (such as the root-mcan-square value) of the magnitude of the quantities Pi deduced from the possible errors in coefficients. As an example consider the results in Ex. 4.2. The maximum residua] has magnitude 5.10" 5 (Ex. 4.4). The possible errors in coefficients can produce a maximum error of 5.10 -5 x, 125.10 -5 1+ I Statistically the error will be considerably less than this but it is unlikely that it will be reduced by the factor of 25 required to make the average | v, \ comparable with the largest residual. Hence the solution obtained in Ex. 4.2 is satisfactory in the sense defined at the beginning of this paragraph. This type of result would seem to be the rule rather than the exception when the method of successive elimination is used, if the equations are scaled properly and one or two guarding figures are kept in the body of the calculation. The following example illustrates an empirical approach to the problem. Ex. 4.5. Alter the coefficients in the equations in Ex. 4.2 in the following way. Write the ten numbers — 4( I )5 on slips of paper and for each a,j and b t pick one of the ten slips at random. Alter the corresponding coefficient by \0~ i times the number on the slip unless this number is 5, in which case alter the number by 5.10" 3 if the end-digit is even and -5.10" 5 // the end-digit is odd. Find the change produced in the roots (4. 1 8). An experiment of this type gave 10 4 u, = -5-23, I0 4 i> 2 = 6-98, 10*b 3 = -4-60, I0 4 t> 4 = 4-46. Sx, = -00852, 8x 2 = 00447, «5x 3 = 00814, 8x 4 = 00398. 98 NUMERICAL METHODS §4.5 The errors in the solution in Ex. 4.2 (see Ex. 4.4) are about one-fifth of these. We have now gone as far as we can, within the scope and limitations of this book, in considering the errors involved in solving simultaneous linear equations. An interesting theoretical treatment from a comparatively elementary point of view is given in Modern Computing Methods, H.M. Stationery Office, 2nd Edn. (1961), Chapter 5. § 4.5. A computer program for the method of successive elimination.-':- Program 4.1 is a computer program for solving an n x n set of simultaneous linear equations by the method of § 4.2. The unknowns are eliminated in the order x lt x 2 , •■■ x„. The pivot at each stage is chosen as the largest element in the first column of the appropriate array. When the statement labelled 12 is reached, this largest element has been determined and the appropriate rows have been interchanged so that the row containing the largest pivot is now the first row in the array from which the next unknown is to be eliminated. When statements 16 and 17 are completed the appropriate unknowns have been eliminated and the machine proceeds to eliminate the next unknown. The remainder of the program is concerned with back-substitution, computation of the determinant, and printing out the results. (It is advisable to print out some measure of the condition of the equations. The size of the determinant and the smallest pivot give some indica- tion of whether the equations are badly conditioned. In order to save unnecessary printing it may be arranged that these are printed only if they fall below certain limits.) These hints should enable the reader to work through the program in detail, as an exercise. This will be more in- structive than pages of further explanation. t Sec also Modern Computing Methods, H.M. Stationery Office, 2nd Edition (1961), Chapter 2. §4.5 LINEAR EQUATIONS 99 Program 4.1 ••Abel Statement Read n, a u for /= 1(1) « and /- 1(1) n, b i hTi= 1(1) „ Print n, a tJ for /= 1(1) n and /- 1(1)/;, b, for i = 1(1) «, Do up to 3 for i = 1(1) « 3 a i.n+i = b t Do up to 17 for r = 1(1) n-] q = r 4 p = q 5 If q = n go to 10 <7 = <7+l 6 If a pr <a v go to 4 Go to 5 10 Do up to 12 for y = r(l)/i + l c l = a n 2 a Pl = Cj 16 Do up to 17 for/ = r + 1(1) nand j = r+l(\)n + l 17 a n = a IJ -(a lr a rJ )/a„ X n = «,,.tl/«,,, Do up to 22 for r = /j-l(-l)l d=0 Doupto20fory = r + l(l)// 20 d = d+a rj xj 22 *, = («,.„ + . -</)/*„ D = a lt Do up to 25 for i = 2(1) // 25 D = Da„ Print Xj for; = 1(1) n, a n for i = 1(1) n, D. Stop 100 NUMERICAL METHODS I4.5 Examples IV (Examples for practice in numerical work can be derived from those in the text by rescaling the equations in Ex. 4.2 or by working the examples to a different number of decimal places. Some of the Examples V are relevant to this chapter.) Ex. 4.6. Modify Program 4.1 so as to solve m sets of equations with the same set of coefficients a,j but m different right-hand sides b tk (k = 1 to in). Ex. 4.7. Draw a flow-diagram for Program 4.1. Ex. 4.8. Show that the number of multiplications involved in solving n simultaneous linear equations in n unknowns by the method of successive eliminations (§ 4.2) is of the order of J/; 3 . (This is an important result because the main labour in solving large sets of equations lies in the multiplications, so that the labour increases rapidly with the size of the set of equations. The same result holds for the compact elimination method of § 5.2 below.) Ex. 4.9. Consider the set of equations (ijXj _ , + bj.\j + CjXj + , = (I j, (j = 2, 3, . . . n - 1), a n x n - i + b n x n = f/„. Show that the x, can be determined from the formulae 0i = -'. b \ yj = b j- a j9j-i (/=2,3,...«-l), , *', h^ A r a h-\ = 2,3,... M ), »i bj-atfj-! X. = K< *j = hj-gjX J+l , (j = B-l, m-2, ... 1). H-S LINEAR EQUATIONS 101 If aj<0, bj>0, Cj<0, and aj+bj+Cj^O, show that — lg0y<O. (The number of operations involved is proportional to n compared with a number of operations proportional to n 3 for a general set of equations. The method can be justified from first principles but it is instructive to consider it from the point of view of the matrix decomposition in Ex. 5.11. Equations of the above type occur in connection with solution of the heat-conduc- tion equation by implicit difference methods as in Ex. 11.2.) CHAPT1R V MATRIX METHODS §5.1. Matrix algebra. Matrices provide a useful method for systematising both the theoretical and the practical aspects of certain computing procedures, par- ticularly in connection with automatic computers. We start Chapters V and VI with short summaries of the necessary theory,! to provide a guide for the beginner who might otherwise be confused by the mass of detail in text- books on matrices. An m xn matrix A is simply a rectangular array of nm numbers, arranged in in rows and // columns. It is said to be of order " in by n " or " m xn": A = [a,,] = "I! "12 a 2l a 21 "ml «1« «2n The (/,y')th element a (i represents the element in the /th row and they'lh column of the matrix. Consider operations involving two matrices A = [a,j] and B = [by]. Equality, addition, and subtraction are meaningful terms if and only if the matrices have the same number of rows and the same number of columns. If this t Furlhcr information can be found in A. C. Ailken, Determinants and Man ices, Oliver and Boyd (1962). 102 §5.1 MATRIX METHODS 103 is true then: A = B if and only if a,] = b t] for all /, _/*. A±B**\ft lt ±b tl l This last equation means that the sum or difference of two matrices with equal numbers of rows and columns is the matrix such that any element is the sum or difference of the corresponding elements in A and B. To multiply a matrix by a scalar, say k, each term of the matrix is multiplied by k: kA = [ka tJ ]. A matrix with only one row is called a row matrix (or a row vector) and a matrix with only one column is called a column matrix (or a column vector). The scalar product of a row matrix and a column matrix, taken in that order, is meaningful if and only if the row and column have the same number of terms, and then it consists of a single number defined as in the following example: [X, x 2 ... x n ] .'•, = xiyi+x 2 yi+-+xj'm Two matrices can be multiplied together if and only if the number of columns in the first is equal to the number of rows in the second. Then the element in the ith row and the yth column of the product is the scalar product of the fth row of the first matrix with they'th column of the second. If A is m x n, B is n x p, and AB = C, or then the rule determining c, k , the elements of the product 104 matrix C, is NUMERICAL METHODS I c ik = £ "ijbjit, §S.1 and C is m xp. As an example, a,, fl, 2 021 a 22 L.031 *32 J _*21 *22J 0|1*11+0|2*2I. «H*I2+«U*M 021*11+022*21. «21*l2+«22*2 2 L«31*l 1+032*21. 031*12 + fl 32*22J The order in which matrices are multiplied together is obviously very important. The existence of the product AB is quite independent of the existence of BA and even if AB and BA both exist they will in general be unequal and there is no very useful connection between the two. In the product AB the matrix A is said to premultiply B, and B is said to postmultiply A. Although the commutative law does not hold for multiplication {AB ^ BA in general) the distributive and associative laws are true, e.g. (A + B)C = AC+ BC, A(BC) = (AB)C = ABC. It is useful to define certain special types of matrix. A matrix with the same number of rows and columns is called a square matrix. Associated with any square matrix A we have the determinant of the matrix: det/f = '11 <nl 'In If A and B are square matrices of the same order then (Aitken, loc. cit., p. 80) det AB = del A . det B. (5.1) A diagonal matrix D = [d,j] is a square matrix with all the d,j zero except </,,, d 21 , ... d nn which are the elements 5S.2 MATRIX METHODS 105 along the principal diagonal, or the diagonal elements. A diagonal matrix all of whose diagonal elements are unity is called a unit matrix, and it is denoted by /. Clearly AI = IA = A, det 7=1. In matrix algebra / plays the same role as unity in ordinary algebra. The transpose A' of a matrix A is defined by the property that if A is an mxn matrix whose (/,y)th element is a,j then A' is an nxm matrix whose (**, y)th element is a jh i.e. [a,j]' = [fly,]. A symmetrical matrix is a square matrix with fl,y = fly,, i.e. A' = A. § 5.2. A compact elimination method for the solution of linear equations. We describe a compact elimination method for the solution of linear equations which is essentially a telescoped version of a successive elimination method (see § 4.2 and Ex. 5.4). It is convenient to work in terms of matrices. We consider only 3x3 matrices when describing the theory, but the extension to the general case is direct. Lower and upper triangular matrices /.. U are defined as matrices with zero elements above and below the principal diagonal, respectively: L = '.. 01 . u = [".1 "12 '2. hi «22 hx ht '33. '1.1 •aa '33. The set of linear simultaneous equations (4.1), for the case n = 3, can be expressed in matrix notation as where A = Mi '21 -"31 012 022 032 Ax = h 13 , X = r*ii . * = r*.i 23 *2 *2 33. *3. *3 106 NUMERICAL METHODS §5.2 We first of all show that A can be expressed as the product of two matrices in the form A = LU where L is a lower triangular matrix as defined above and U is an upper triangular matrix with units along the principal diagonal (i.e. u„ = 1). (The diagonal elements of cither L or U, but not of both, can be chosen arbitrarily.) Wc have «ll «13 "is] = [hi 01 ri "12 "ti «2l a 22 «23 lit hi i "23 *31 "32 "33. .'31 hi »33. 1 /,, / U U, 2 / U H 13 '21 '2l"l2 + '22 '2l"l3 + '22"23 .'31 '3l"l2 + '32 '3l"l3 + '32«23 + '33. On equating the individual elements in the first and last matrices we obtain equations which determine the elements of L and U. It is found that the elements of L and U can be determined as follows: (a) The first column of L: In = "ii> '21 - *au 'i\ = «si« (b) The first row of U: (c) The second column of £.: ^22 = «22-/2l"l2. fa " «32-/3l"l2- (d) The second row of U: "23 = (.a 23 -I 2 l"t3)lll2- (e) The third column of L: '33 = fl33- / 3l''l3-'32''23- §5-2 MATRIX METHODS 107 The importance of doing the computations in the above order is that the quantities required to compute the value of an clement at any stage of the calculation have already been computed at some previous stage. In hand computation it is of course important to evaluate sums of products in one machine operation, without intermediate recording. Assuming that L, t/are known, we can write Ax = b as LUx = b. We introduce y defined by y = Ux. Then (5.2) gives Ly = *. Written out in full, equations (5.3a, b) are: '2I.Vl + '22>'2 - l>i iiiyi+iiiyi+hsyz = b 3 (5.2) (5.3a) (5.3b) (5.4) Xi+«ia*a+«i3*a = .Vi *a+»«3*a ■ y% ■' *3 = ^3 The values of y lt y 2 , y 3 can be computed in succession from the first set of equations and then .v,, x 2 , x 3 can be com- puted in succession from the second set. If we solve the first set of equations in (5.4) for the y t wc sec that yi = hflut y2 = <6a-/aitt)//aa, (5.5) )'3 =(*3-/3!>'l-/32>'2)//33- Before considering the organisation of the calculation we show that numerical checks can be obtained in the following way. We first of all define the row sums S 3 = a 2l +a 22 +a 2i +b 2 , •*3 " «31 + fl 32 + fl 33 + '»3- ioe NUMERICAL METHODS I §5.2 A little reflection shows that if the b, on the right of the first set of equations in (5.4) are replaced by s it then the y, on the right of the second set are replaced by the quantities §5.2 MATRIX METHODS 109 I, = l+u l2 + u l3 +y„ S 2 = l+H 2 , + )' 2 , I 3 = l+y 3 . (5.6a) The same quantities are obtained if we follow the procedure used to derive the y t : "i ■ (ft— ki ff dlhn o 3 *>iH-hiPt-h&iMhs> (5.6b) Algebraically the a, and I, represent identical quantities but in numerical work they are computed in different ways so that the numerical values may differ slightly in the last significant figure due to round-off error. The approximate equality of the a, and S, serves as an independent check on the calculation. We now describe a practical procedure for carrying out the above method. f The symbols have been defined above except for the column check-sums (cf. §4.2): S, = a n + a 2l +a 3i (/ = 1, 2, 3), ,9 4 - ftffifc+fc, S m S t + S 2 + S 3 + S A , s = s t +s 2 + s 3 . It is convenient to divide the calculation into three stages. t P. D. Crout, Trans. Amer. Inst. Electr. Engrs. 60 (194!), 1235-40. Another layout is given in Modern Computing Methods, H.M. Stationery Office, 2nd Edition (1961), Chapter 1. Which of these layouts is recommended depends on personal preference. (i) Write down the original matrix with check sums: a w fl t2 a 13 fc, faj) a 2l a 22 a 23 b 2 (s 2 ) "31 <»32 "33 ''3 fo) (S.) (S,) (S 3 ) (S 4 ) (S = s). (ii) Write down an auxiliary matrix. The method of deriving this matrix is given below. It may be noted at this point that the %, y h and a, are all determined by following exactly the same rule (cf. (b), (d), (5.5), and (5.6b) above, and the rule given in the paragraph preceding Ex. 5.1 below). '., "12 "13 .Vi (oi * S.) I« hi "2 3 y2 (o 2 * S 2 ) f M hi hs y 3 (ff, * Sa). (iii) Obtain the unknowns by back-substitution in the second set of equations in (5.4) and apply the final check: S 4 = S t Xi + S 2 x 2 + S 3 x 3 WS 4 . The elements of the auxiliary matrix are written down in the following order: (a) Elements of the first column, /„. (b) Elements of the first row to the right of the diagonal, M iy. >"ii ff i- Then apply a check : I, = l+Uu+Uu+y^ffy. (c) Elements of the second column on and below the diagonal, l l2 . (d) Elements of the second row to the right of the diagonal, u 2J , y 2 , a 2 . (Notice that S, not o~, is 110 NUMERICAL METHODS §5.2 used to compute <r 2 (cf. Ex. 4.2).) Then apply the check: £ 2 = 1+ "23 +J>a Ka i- (e) Elements of the third column on and below the diagonal, / 33 . (f) Elements of the third row to the right of the diagonal, y 3 , ff 3 . (Note that I,, I 2 not a,, a 2 are used to compute <r 3 .) Then apply the check: Z 3 = l+>"3 ft ff3- The elements of the auxiliary matrix are determined by the following general rule: Each element on or below the diagonal is equal to the corresponding element in the original matrix minus the sum of products of elements in its row with corresponding elements in its column, in the auxiliary matrix, where these products involve only pre- viously computed elements. Each element to the right of the diagonal is obtained by exactly the same rule except that finally we divide by the diagonal term in the same row. This last part of the rule applies to the My, ;'„ and a, but it does not of course apply to the checks I, which are defined as in (5.6a). If the matrix A is symmetrical, labour can be saved by noting that each element My of U is equal to the symmetrically situated element l J{ of L divided by the diagonal element /,,-. Ex. 5.1. Apply the above method to the equations: .Y,+4.Y 2 + x 3 + 3x 4 - 1. - .x 2 + 3.v 3 + .v 4 = - 4, (5.7) 3.v,+ .y 2 + 6.y 3 -10.v 4 = -11, -Y,-2.Y 2 + 5.Y 3 = 1. §5.2 MATRIX METHODS 111 Show that the auxiliary matrix, including check columns, is 1 4 1 3 1 (10) -1 -3 -1 4 (1) 3 -11 -30 1 -1 (1) 1-6-14 5 2 (3), and deduce the solution x, = 10, x 2 = -3, Xj = -3, x 4 = 2. Ex. 5.2. Solve the equations of Ex. 4.2 by the above method The auxiliary matrix is (omitting the check column) 0-5400 0-969 07 0-807 04 0-670 74 2-410 19 0-5233 0-28009 —0-214 66 0-623 73 -3-640 80 0-4358 -0060 12 0023 49 1-808 86 0057 65 0-3622 0174 70 004249 0008 93 -6-357 51 Roots: .v, = -5-3717, .Yi-2-8055, * 3 =ll-5575, .v 4 - -6-3575. Errors: Sxi=— 0-0151. 8.v 2 =0-0082, 8*3= 00154, Sx 4 = -00078. It is seen that the errors arc almost exactly the same as those obtained by the method of successive elimination in § 4.2 (see Table 4.1). Theoretically we should expect the compact method to be slightly more accurate than the method of successive elimination since there arc fewer round -off errors. Ex. 5.3. Show that the compact method gives the following results for scaled equations in Ex. 4.3: Roots: .n=— 5-3715, * 2 = 2-8054, a- 3 = 11-5575, .v 4 - -6-3574. Errors: 8ai=— 00153, S.v 2 0-0083, S.v 3 -=0-0I54, 8x4= -00079. Ex. 5.4. Derive the compact method corresponding to the method of successive elimination o/§ 4.2, in the following way, from equations (4.4)-(4.10). We have e 4 = </ 4 -/ 43 f/ 3 . In this expression substitute 112 NUMERICAL METHODS I §5.3 8 5.3 MATRIX METHODS 113 for d A in terms of c A , and then for r 4 in terms of b 4 to find e t = b A -U x b i -U 1 c 2 -l 4i d i . Similarly show that all the terms in (4.5), (4.8), (4.10) can be expressed in terms of quantities which occur in these equations, the original matrix, and the l l} . Show that a similar result holds for the l- tJ . Hence it is unnecessary to record the quantities in the square brackets in (4.6), (4.7), (4.9). Show that the resulting formulae are in fact exactly those involved in solving Ax = b by decomposing A in the form A = LU, where L has units along the principal diagonal. In the compact method described above we have arranged that it is U which has units along the diagonal instead of L. Show that this is a condensed form of the method of successive elimination in which, for instance, we divide the row of quantities a ti , b lt s { in (4.4) by a lt instead of dividing the column a n by «,, and similarly for the other arrays. There are various methods of choosing the diagonal elements in L and U, and arranging the methods of successive elimination and compact elimination. The precise variant which is chosen is largely a matter of personal preference. § 5.3. The inverse matrix. Suppose that, given an n x n matrix A, there exists a matrix Z such that AZ = I. Then it is possible to show that ZA = I. The matrix Z is called the inverse of A and it is usually denoted hy^~*. When A' 1 exists we have A~ l A = AA~ l = /. From this relation, using (5.1), we have det A. det A~ l = 1. (5.8) Hence a necessary condition for the existence of the inverse of A is that det A j= 0. If this is true then the matrix A is said to be non-singular, but if det A = the matrix A is said to be singular and in this case the inverse does not exist. Instead of dividing by matrices we multiply by the inverse matrix. As an example consider the simultaneous linear equations (4.1) which in matrix notation are Ax = b. (5.9) We assume that the determinant of the coefficients is non-zero so that a unique solution of the equations exists, and the inverse matrix also exists. To solve (5.9) we wish to remove A from the left-hand side. Premultiply both sides by A~': A~ l Ax = A-'b. On using (5.8) we have x = A~'b. (5.10) One practical procedure for the determination of the inverse matrix is obvious from its definition. Denote the yth column of A' 1 by z } and let dj denote the nx 1 column matrix whose elements are zero except for the jlh clement which is unity. Then AA~ l = /gives Azj = dj, (J= 1,2, ...«)• These are just n sets of simultaneous linear equations with n different right-hand sides of simple form, but the same coefficient matrix on the left. They can be solved by extending any method for solution of a single set of equa- tions. As an example consider the extension of the compact method of § 5.2. The " original matrix " will appear as 114 NUMERICAL METHODS I §5.3 §5.3 MATRIX METHODS 115 follows : «u a i2 Ol3 10 <*l) a 2l «22 «23 10 fe) 031 «32 «33 10 &) (5.11) (Si) (S 2 ) (S 3 ) (10) (10) (10) (S - s). There is still only one check column, obtained by adding all the elements in the corresponding rows. The " auxiliary matrix " will be of the form 111 '2. "12 "13 '22 "23 in 11 '"21 l»31 m 22 »'32 '"33 (a, m I,) (<7 2 « Z 2 ) '31 '32 '33 '"31 '"32 '"33 (C3 * 2 3 ). In the position occupied by the unit matrix I'm the original matrix there is a lower triangular matrix with elements m tj . The calculation can be completed by forming a back- substitution matrix of the following form: 1 »12 "13 A \\ A \2 *13 1 u 23 A 2i A 22 A 23 I Am A 32 A 33 . The A,j are the required elements of A''. Checks are given by S i A, J +S 2 A 2J +S 3 A 3J =1, 0=1, 2, 3). The square matrix on the left of the back-substitution matrix is simply the upper triangular matrix V. The elements A u are obtained in the order: last row from left to right, second last row from left to right, and so on. The last row is the same as the corresponding part of the last row in the auxiliary matrix. The second last row is given by the formula A 2J = m 2j -u 23 A 3J , (J = 1, 2, 3), where of course m 23 is taken to be zero. The general rule for the element A,j is: Any clement is given by the cor- responding clement in the auxiliary matrix less the sum of products of previously computed elements in its column and corresponding elements in its row in the U matrix. The /th element in its column is taken with the ;th element in the row of the U matrix. The method extends directly to the general nx.n case. It is seen that the amount of labour involved in finding an inverse matrix is about three times the work involved in solving a single set of equations, independent of the value of/;. Ex. 5.5. Find A' 1 for the matrix A of the equal ions in Ex. 4.2. Part of the auxiliary matrix has already been given in Ex. 5.2. A' l = 399-8 -213-3 -401-7 198-2 -213-3 120-2 2141 -113-3 -401-7 2141 409 -202-6 198-2 -113-3 -202-6 1120 Ex. 5.6. Solve the equations in Ex. 4.2 by the formula x = A~'h, using the inverse matrix found in Ex. 5.5. Roots: -vi -5-3822, a- 2 = 2-8157, A- 3 =lI-5627,^=- -6-3607. Errors: &vi= -00046, Sx 2 - -00020, S.v 3 =00102, S.v<= -00046. Note that in this example the individual terms of A~ l are large compared with the final roots. This indicates that the equations are ill-conditioned since in the sum of products required to form the roots large terms cancel to give small end-results. On the other hand the degree of ill-conditioning is not excessive. We make the following comments in connection with the discussion of pivots and scaling in §4.3. The elements of 116 NUMERICAL METHODS I §5.3 §5.3 MATRIX METHODS 117 A~ l in Ex. 5.5 are comparable with each other in size. Since we arc using fixed-point working and the elements of A are comparable this means that the errors of the elements of A' 1 should be comparable. If there is considerable variation in the sizes of the largest elements in the rows (or columns) of an inverse we have to be careful that we are not losing accuracy unnecessarily in the sense that the accuracy of the inverse could be improved by scaling the original matrix. In the method described above for in- verting a matrix we can obtain various choices of pivots by rearranging the rows and columns of the original matrix. Whatever choice of pivots is used the inverse of the last is always an element of the inverse matrix. If the last pivot is small then at least one clement of the inverse matrix is large, and the equations are to this extent ill-conditioned. This is one way of seeing that the size of the last pivot has a second-order effect on accuracy. The condition of the equations Ax = b depends on the matrix A and not on b. The equations are ill-conditioned if, when we check that A' 1 A is equal to /, then terms large compared with unity have cancelled to give (approximately) units along the principal diagonal and zeros elsewhere. If B is a matrix such that BAsal then we say that B is an approximate left-inverse of A, and similarly if ACxl the matrix C is an approximate right-inverse of A. Due to ill-conditioning, even if ACxI the product CA may not be a good approximation to /. As an example consider A = Then ri-oo looi, c = r-89 ioo~i. |_100 0-99J L 90 - 100 J = no oo~i, cA = r n 101. |_oi ioJ |_- 10 ~ 9 J A good right-inverse may not be a good left-inverse, and conversely. The situation in numerical work is quite different from the situation in the algebraic theory where the left- and right-inverses arc the same. We conclude this chapter by expressing in terms of matrices the discussion at the end of §4.4 concerning accuracy when solving linear equations. We define a column matrix of residues, r, by the equation r = AX-b, where A' is an approximate set of roots. If the exact roots x are related to the approximate roots by the equation x = X+ 8x, where 5x is a column matrix of errors, we readily find A8x = -r, i.e. 5x = -A' l r. (5.12) This is the matrix form of (4.22). If errors in the coefficients are given by 8A = [Sa.j], 5b = [5b,], and the corresponding new roots are denoted by x + 5x, we have (A + SA)(x+8x) = b + 8b. On neglecting the second-order term SASx we find 8x = A- , (Sb-5Ax). (5.13) This is the matrix form of (4.23). When using (5.12) it is essential to evaluate the residuals r to the number of significant figures required in Sx, and since the residuals are small this may mean using a large number of decimal places when calculating the residuals. On the other hand A~ ' need only be known to the number of significant figures required in 8x, which may mean that only an approximate determination of A~ x is required. If an approximate value of A' 1 is known then the roots of Ax = b can be evaluated as accurately as required by the 118 iteration NUMERICAL METHODS §5.3 x — \" )approi "> *«+i = x,-(A- l ) tnm r„ (s = 0, 1, 2, ...), (5.14) where x s is the sth approximation to x. The residuals must be evaluated with the appropriate accuracy at each stage. This is an illustration of the principle that in an iterative procedure it is often possible to use approximations in certain parts of the computation, provided that the requisite accuracy is maintained in certain vital parts of the calcula- tion. Ex. 5.7. Check the results of Exs. 4.4, 4.5 by using the inverse matrix in Ex. 5.5 in conjunction with (5.12), (5.13). When working with simultaneous linear equations it often happens, as in Ex. 5.7 compared with Exs. 4.4, 4.5, that we can either evaluate z = A~ x d if A~' is available, or we can solve Az = d where the major part of the work has already been done since another set of equations Ax = b has already been solved. To some extent the choice is a matter of personal preference. In hand computing it may be preferable to work with A' 1 since multiplication by A' 1 is rather more straightforward than adding an extra column to an auxiliary matrix and back-substituting. Also the size of the terms in A~ x gives some indication of the condition of the matrix. When using a computer it may be preferable to solve an extra set of equations, but this of course depends on what is stored inside the machine. Examples V Ex. 5.8. Write a computer program (or a flow- diagram) for the solution of linear equations by the compact method of § 5.2. §5.3 MATRIX METHODS 119 Ex. 5.9. Write a computer program (or a flow-diagram) for the inversion of a matrix using (i) the compact method explained in connection with (5.11), (ii) the successive elimination method of § 4.2 (compare Program 4.1). Ex. 5.10. A compact method which is sometimes useful when it is desired to avoid small pivots is the square- root method in which we choose /„ = u„ in the decomposi- tion A = LU. Work out the details and show that when A is symmetrical U is equal to the transpose of L. Apply the method to the example in Ex. 4.2 and show that the errors are about two-thirds of those by the method of Ex. 4.2 (Table 4.1). (The labour involved in taking square roots is justified only if otherwise the pivots become undesirably small in the sense discussed in §4.3.) Ex. 5.11. Show that the band or tridiagonal matrix A defined by a,j = for | i-j \ > 1 can be decomposed in the form LU where L is a lower diagonal matrix with /, ; = for i—j> 1 and U is an upper diagonal matrix with « (i = 1 Find L and U for and Ujj = for /— /> 1 . r 2 -1 0" -i 2 -1 -1 2 -1 -1 2 A = Consider the procedure in Ex. 4.9 from the point of view of the compact elimination method of § 5.2. Ex. 5.12. Find the exact inverse of the matrix A given at the end of Ex. 5.11, using the compact method explained 120 NUMERICAL METHODS I §5.3 in connection with (5.11), expressing all non-integral numbers as fractions instead of using decimals. Ex. 5.13. Show that (AB)' = BA', and {AB)~ i = B i A~ 1 if both inverses exist. Ex. 5.14. Evaluate det A, the determinant of the coefficients a tj , for the example in Ex. 4.2, to four significant figures. (The pivots in Ex. 4.2 where five decimals were carried in the intermediate calculations give det AkO-3173 x 10~ 4 . A similar calculation carrying six decimals gives det Ax0-3l67x 10 -4 . Keeping one extra decimal place does not mean that the error is reduced by exactly one decimal place, since round-off errors occur at random. But the results indicate that the last result is correct to one digit in the fourth significant figure. How could the accuracy be checked if you were using a digital computer?) Ex. 5.15. Suppose that we arc given in pairs of points (xj, yj) and it is required to represent the numbers y t approximately by a relation of the form y s «f(xj) where Ax) = -,/ 1 (-v) + z 2 / 2 (j:)+... + r n / n (.v), (/,<»,), where the/,(.v) are known functions and the z, are constants which are to be determined. This procedure is known as curve-fitting. We set i *» = X z ifix,)-y s , (.v = 1 to m), i= i and determine the z, by the principle of least squares. We minimise Partial differentiation with respect to Zj(j = 1 to n) gives n equations in /; unknowns which can be written, in matrix §5.3 MATRIX METHODS 121 form, A' Ax = A'y, where a = K-] = [f/xdl t = M, y - [>'<]. and A is m x «, z is n x 1, y is m x 1 . These equations can be badly conditioned, especially if the points (pc Jt y t ) tend to group in clusters. In such cases it may be advisable to replace a cluster of say p points by a single point at the centre of gravity of the cluster, repeated p times. Fit a quadratic to the following points: 0-52 1-22 206 3-47 400 4-60 2-90 1-72 1-20 200 2-61 405 yj Ex. 5.16. Show that the inverse of the matrix of coefficients of the equations (5.7) is "27 -271 -19 1801 21 67 13 -60 3 81 9 -30 12 -26 -14 30 <4-'=Tk Ex. 5.17. The effect on the solution of a set of simul- taneous linear equations Ax = b of modifying one of the coefficients a i} can be analysed in the following way. Suppose that we know A~ l = [Ay] and we wish to solve (A + D)x = b where D is an nxn matrix with typical element d,j such that d iS is zero unless / = r,j = s, i.e. cl ri is the only non-zero element of D. Then (I+A- , D)x = A-'b. If x* is the solution of Ax* = b show that x.- x* l+^A Xj = Xj-A Jt d rs x s , (j * s). 122 NUMERICAL METHODS §5.3 Show that the solution of (5.7) with the second equation replaced by -x 2 + 2x 3 +x 4 = -4 is given by 23.v, = 501, 23.v 2 = -136, 23v 3 m -150, 23v 4 = 72. CHAPTFR VI EIGENVALUES AND EIGENVECTORS §6.1. Introduction. Many of the applications of matrices involve eigenvalues and eigenvectors. To intro- duce the ideas which are involved we discuss a simple physical example. Further applications occur, for instance, in connection with the principal axes of a quadric, and the principal moments of inertia of a solid body.f The reader who does not find physical examples illuminating can start with the mathematical definition (6.3) below. Consider three particles, each of mass m, at equal distances / along an elastic string of length 41, where the string is fixed at both ends. If the particles execute small transverse vibrations under no external forces, the equations of motion are in in m d 2 X x dl 2 d 2 X 2 _ dl 2 d 2 X- _ TX t T(X 2 -X X ) T(X 2 +X,) T(X 3 - X 2 ) dl 2 1 _ _ T(X 3 -X 2 ) _ TX 3 I I ' t See W. H. McCrea, Analytical Geometry of Three Dimensions, Oliver and Boyd (1960) and D. E. Rutherford, Classical Mechanics, Oliver and Boyd (1960). For mathematical properties or eigenvalues and eigenvectors (Latent roots and vectors) sec A. C. Aitken, Deter- minants and Matrices, Oliver and Boyd (1962). 123 124 NUMERICAL METHODS §6.1 where X u X 2 , X 3 are the displacements of the three particles and T is the tension in the string. If all quantities vary sinusoidally with time we can set X, = x,e"°<, (r = 1, 2, 3). Then the above equations become (2-;.).v,- x 2 =0, -.v,+(2-A).y 2 - .v 3 = 0, (6.1) -.v 2 + (2-A).v 3 = 0, where A = co 2 ml/T. This is a homogeneous set of simul- taneous linear algebraic equations which in general possesses only the trivial solution x, = x 2 = .v 3 = 0. However, non-zero solutions will exist for special values of A. From Cramer's rule (4.2) for the solution of linear equations in terms of the ratios of determinants we deduce that a necessary condition for the existence of a non-zero solution of (6.1) is that the determinant of the coefficients is zero: 2-A -1 ■1 2-A -1 o -l 2-;. = 0. (6.2) By expanding this determinant we obtain a cubic equation in A which can be solved to give the three special values of A for which non-zero solutions of (6.1) exist, namely A = 2-^2, 2, 2 + V2. If in (6.1) we set A = 2—^2 then we find that the resulting equations have a solution x, = C, x 2 = Cyj2, x 3 = C, where C is any arbitrary constant. The physical meaning of this is that corresponding to A = 2— N /2 there is a free vibration of angular frequency io given by co 2 = 7(2 — N /2)/w/, and corresponding to this frequency of vibration there is a mode of oscillation such that the ratios x t : x 2 : x 3 are given §6.1 EIGENVALUES AND EIGENVECTORS 125 by 1:^/2:1. Similarly for A = 2, x,:x 2 :x 3 = 1:0:— I, forA = 2 + 72, x,:x 2 :x 3 = I: -y/2:l. This analysis shows that there are three and only three frequencies of free vibration. Corresponding to each of these frequencies there exists a solution of equations (6.1) which gives the corresponding mode of oscillation. The modes for this particular case are shown graphically in Fig. 6. 1 . The reader should visualise these modes physically. We now express these ideas in more precise mathematical terminology. If A is an n x n matrix then, generalising (6. 1), we consider the equations 04-A/)x = 0, or Ax = Xx. (6.3) From Cramer's rule (4.2) these equations possess only the trivial solution x = unless the determinant of the co- efficients is zero: det(.4-A/) = fl,,-A fl l2 «2I «22-' 1 ••■ a .,1 >n2 "in fl„„-A = 0. (6.4) This is a polynomial equation in A. The roots of this algebraic equation are special values of A for which the simultaneous equations (6.3) possess non-zero solutions. They are called the eigenvalues of the matrix A and will be denoted by A,- (1 = 1 to n). The Aj are sometimes called latent roots, characteristic roots, or proper values. Cor- responding to each of the ;., there will be a solution of (6.3) of the form Or, where x, is a non-zero vector and C is an arbitrary constant. These solutions are called the eigen- vectors (or latent vectors, characteristic vectors, proper vectors). We have Ax x = A,*,. (6.5) 126 NUMERICAL METHODS I §6.1 Wc now deduce some properties of the eigenvalues and eigenvectors for the special case where the a ti are real and ^,=2-^2 ^3 = 2+72 A, A j A-j Fig. 6.1 Modes of vibration of three panicles on a string A is symmetrical. We first of all show that then the eigen- values and eigenvectors are real. In the general case, since the ?., are roots of an algebraic equation, the eigen- values may be complex and therefore the elements of x, §6.1 EIGENVALUES AND EIGENVECTORS 127 may be complex. Denote by x, a vector whose elements are the complex conjugates of those of x,. The transpose x J of x, is a row vector whose pth element is the />th element of the column vector x,. Multiplication of (6.5) by x\ gives: X, = x',4x,/x",x,. (6.6) (We adopt the convention that if y, z are n x 1 column vectors with elements ;>,, z t , (;' = 1 to »), then y'z denotes a scalar defined by y'z - ?iif+JUb+.«.+Jfti the inner product of y and z. Strictly speaking y'z is a 1 x 1 matrix and the inner product should be denoted by some special symbol such as x.y or (x, y) but no confusion will arise from our usage.) On writing x',x ; and x'lAx-, in full it is seen, on using the assumption that A is symmetrical, that both of these expressions are real. Hence the X, are real. Corresponding to each of the /.,, equations (6.3) will possess a solution of the form Cx, where the elements of x, are real. From now on we can therefore drop the bars. If we multiply (6.5) by x}, 0' ^ /), write down the corresponding expression with / and j interchanged, and subtract the two expressions, we obtain, since Xj<4x, =x\Ax l and x'jx, = x,x„ (;.,-;», = 0, (i*j). We shall assume that the eigenvalues are distinct, i.e. A, / /,. Then x.'x ( = 0, (i*j). (6.7) If this equation is satisfied the vectors x ( and x ; are said to be orthogonal and the eigenvectors x, form an orthogonal set of vectors. Since the elements of an eigenvector are derived from homogeneous equations these elements can be multi- plied by an arbitrary constant. It is usually convenient to 128 NUMERICAL METHODS §6.1 adjust the size of the elements by some standard rule. This procedure is called normalisation and it can be done in various ways. We shall use two kinds of normalisation in this chapter: (i) The eigenvectors can be normalised so that the numerically largest element of each eigenvector is unity, (ii) An eigenvector x, with elements d,, d 2 , ... d„ can be normalised so that x\ Xl = d] + d\ + ... + dl=\. In this case we say that the eigenvectors are normalised to have unit length. If a set of vectors is such that it is impossible to find constants a,, not all of which are zero, such that a,x,+a 2 x 2 + ...+a fl x ll = 0, (6.8) then the vectors are said to be linearly independent. A useful deduction from this definition is that if a relation of the form (6.8) holds between linearly independent vectors then a, = a, = ... = a„ = 0. Orthogonal vectors are linearly independent for, if a relation of the form (6.8) exists, we see on multiplying by x', that a, = and this holds for all ;'. An arbitrary n x 1 vector u can be expressed uniquely as a sum of multiples of n orthogonal «x 1 vectors: u = a i x x +a 1 x 2 + ... + a n x„ (6.9) where the a, are suitable constants. For if the a, are regarded as unknowns then (6.9) is an n x n set of simultaneous linear equations for the a,. The determinant of the coefficients is non-zero since otherwise the sum of certain multiples of the columns of the determinant would be zero, i.e. (6.8) would be true, but we have just shown that this is impossible. §6.2 EIGENVALUES AND EIGENVECTORS 129 Hence the determinant of the coefficients is non-zero, and values for the a, exist and arc unique. In fact we can easily write down the solution explicitly since multiplication of (6.9) by x\ gives a, = x'fl/x'iX,. F.x. 6.1. From the results for the problem discussed at the beginning of § 6. 1 deduce that the matrix A - r 2 - 1 0] -1 2-1 (6.10) .0-1 2. has eigenvalues /, =2-^/2, }. 2 = 2, ;. 3 = 2+^/2, with corresponding eigenvectors, normalised to unit length, x, = ' i 1 . x 2 = "W2~ . X 3 m i 1 W* -W2 L * . L-W2j L i . // Is readily verified that x\Xj = 0, (i # j). § 6.2. An iterative method for finding the largest eigen- value. In the last section we obtained the eigenvalues for a particular example by direct expansion of the determinant (6.2). This will not be a very efficient method for deter- minants of large order. In this section we discuss an iterative method for finding the largest eigenvalue of a general n x n matrix. We assume that the matrix is symmetrical and real, since this is the case covered by the theory in §6.1. The eigenvalues arc then real, and we shall assume that there is one eigenvalue of maximum modulus. An extension to the case where there are two eigenvalues of the same modulus, but opposite signs, is given in Ex. 6.7. The method given 130 NUMERICAL METHODS I §6.2 ill this section can be extended to other cases (but these will not be examined in detail): (a) If the matrix is real but unsymmetrical, it is necessary to introduce biorthogonal functions as in Ex. 6.9. The method given in this section applies if there is one eigen- value of largest modulus. In this case the eigenvalue of largest modulus will automatically be real, since, if the matrix is real, complex eigenvalues always occur in con- jugate pairs. A method for dealing with complex conjugate dominant eigenvalues is given in Ex. 6.8. (b) If the matrix has complex elements, it is relatively easy to extend the theory in § 6.1 to the case of a Hermitian matrix, where the elements are complex, but a }t = a,j. The eigenvalues are real, and the method given in this section applies if there is one eigenvalue of maximum modulus. (c) For the general complex matrix the method, given in this section applies, provided there is one eigenvalue of maximum modulus, though, as in (a), it is necessary to introduce biorthogonal functions in the theory. Suppose that « is an arbitrary ;jxl column matrix. By successive premultiplication by A form the sequence "i = Au > "2 = ^«i = ^ 2 «o. u p = Au p . 1 = A"u . (6.11) (The suffix p in u p is used in a different sense from the suffix /' in x,. This distinction should be borne in mind throughout this section.) On assuming a representation of the form (6.9) for u and using (6.5) we see that u, = Au Q = a l X i x l +a 2 X 2 x 2 + ... + a„X n x n , u p = A"u = a l X\x[+a 2 X p 2 x 2 + ... + a n X" n x n (6.12) ">*■+,£ «'(f'Y (6.13) §6.2 EIGENVALUES AND EIGENVECTORS 131 Suppose next that | ;., | > | X 2 | ^ | X 2 | ^...£ | X a |. Then (Xi/X,)" tends to zero as p tends to infinity (/>1). If u p , x, arc the sth elements of u p , x, for some specific s, and we define we find ', = *! I P = uJUp-l, *l J = 2 \X, 1 = 2 \X { (6.14) »^i as p-*co, (6.15) where B, = a,x,/a 1 x l . In deriving this result we have made two important assumptions: (i) There is only one eigenvalue with largest modulus. The general case is outside the scope of this book (cf. Exs. 6.7, 6.8). (ii) The trial vector u is such that a, ^ 0. For rapid convergence we should try to choose u so that a, is as large as possible relative to the other a,. The approximate form of « is often known on physical grounds. In numerical work, instead of working with the sequence u p it is convenient to form the sequences z p , w„, starting from an arbitrary vector z = / m> , by means of the equations Aw„. t = z p = l p w p , ip = 1,2,3, ...), (6.16) where l p is now defined as the largest element of z p , so that the largest element of w p is unity. Suppose that the largest element of x, is the 5th element and that x, is normalised so that this element is unity. A minor difficulty arises in the theoretical analysis because for small p the largest element of z p (and hence the unit element of w p ) may not be the sih element due to the preponderance of com- ponents other than x, in the initial trial vector. But we 132 NUMERICAL METHODS I §6.2 know, from assumption (ii) above, that eventually the effect of Xy will predominate. Hence without loss of generality we can assume that the slh element of z p is always the largest element. We write where we denote the 5th element of x,(i> 1) by x h and the a, arc constants whose precise values are not important. From (6. 1 6) we find H>, = | *.+ Z «.(£Y4/{l+ t +($f*A> M7) = 2 A, 1 = 2 *1 As before l p tends to X, as /? tends to infinity. Also w p tends to x { . If convergence is slow it will be desirable to use extra- polation. In order to simplify the analysis we make a third assumption: (iii) X 2 * -X 3 . (6.18) (We have already assumed | X% \ e£ I *3 I- This new assumption imposes an additional restriction.) On using this assumption we have, for sufficiently large p, l„ [i+ct 2 x 2 {). 2 ix i y 'J where k = cc 2 x 2 (X 2 — ^i)l^2- An improved approximation /., for /, can be obtained by setting /, = Mi +*(;.,/;.,)"}. (6.19) On replacing p by p— 1 and eliminating k we find L l -l p = (X 2 /X l )(L l -/,-,). (6.20) §6.2 EIGENVALUES AND EIGENVECTORS 133 Hence in the terminology of § 2.4 we are dealing with a first-order iterative procedure. Convergence can be accelerated by using Ailken's (5 2 -process (2.15). (Compare (6.20) and (2.13).) An improved estimate of X t is therefore given by in - 'p- r 9 v l, "fi ■ (62,) 'p-^'p-l + 'p-2 In exactly the same way (6.17) gives m> p «x x + k(X 2 IX l y, where k = a 2 {* 2 - WiM 2 )x 2 .v,}. Hence convergence of the successive approximate eigen- vectors can also be accelerated by using Aitken's <5 2 -process. But it must be remembered that the above analysis is not valid if X 2 = - ?. 3 , and the extrapolation will not be very accurate if X 2 is approximately equal to — X 3 . In applying Aitken's formula it is essential that no numerical mistakes be made in computing the iterates used. For the special case when A is symmetrical a numerical check on successive iterates is given by the following formula derived from (6.16): p+i' 'p+j - /,»> p . (6.22) From (6.20) it is clear that the rate of convergence of the original first-order iterative procedure depends on XJX 2 . The larger this ratio the faster the convergence. From (6.19) it is seen that X, K / p - 1 -/p- 2 _ f 1,-1,-1 • say. (6.23) The value of l p can similarly be estimated from the com- ponents of n>p_ 2 , w p . ,, w p . In terms of t p Aitken's formula (6.21) is (cf. (2.18b): 134 NUMERICAL METHODS I 8 6.2 §6.3 EIGENVALUES AND EIGENVECTORS 13S In hand computation it is often convenient to compute /,, mentally after each iteration and extrapolate using Aitkcn's formula as soon as successive iterates begin to vary regularly, as indicated by agreement between successive values of t p . If successive values of t p show no signs of behaving regularly as p increases this indicates that some of the basic assump- tions (i)-(iii) in the above analysis are wrong (cf. Exs. 6.7, 6.8). Ex. 6.2. Obtain the largest eigenvalue and the corresponding eigenvector of the following matrix, working to four decimal places: A = 0030 -0-242 -0-603 0178 0-242 0-860 -0-343 0-393 0-603 -0-343 1-350 0-251 0178 0-393 0-251 2-630 (6.24) Iteration with h-' = [1,1, 1, 1] gives z* = 2-7386 "00180 0-1808 01349 1-0000 , z 7 = 2-7381 00191 01818 01316 1-0000 ,z 8 = 2-7379 [00198" 01825 01296 .1-0000. • The numerical check (6.22) gives / 8 h>> 8 = 2-8770 8 and l 1 w' 1 w 1 m 2-8770 2 . The following values of f 3 .../ 8 are found: -17, -2-2, 12, 11, 30, 2-5. The last two values are not very accurate due to round-off error. However, the values of t p indicate that successive iterates are beginning to behave regularly when p = 8 and this can be confirmed by estimating t p from the components of the eigenvectors. Since the round-off error is starting to become appreciable compared with the differences between successive iterates z 6 , z 7 , z 8 it is advisable to apply Aitken's <5 2 -process at this point. (If we were working to five or more decimal places we might iterate again to confirm the regular behaviour.) We obtain x, = 0-0198 + 00012" = f"0 0210"| 0-1825 + 0-0016 01 841 0-1296-0-0031 01 265 10000 1-0000 A further two iterations give A, = 2-7376, .v, 00207 01836 01 265 1-0000 Further iteration merely repeats these values which therefore represent the final estimate of the solution. § 6.3. The determination of subdominant eigenvalues and eigenvectors. When the largest or dominant eigenvalue and the corresponding eigenvector of an nxn matrix has been obtained various methods are available for replacing the matrix by an («-l)x(«-l) matrix with eigenvalues k 2 ---X n . The largest eigenvalue and corresponding eigen- vector of this new matrix can then be determined. The procedure can be repeated until all the eigenvalues and eigenvectors of the original matrix have been determined. We confine our attention to the case of a real symmetric matrix with real, distinct, eigenvalues. The method described below applies only to symmetric matrices. It depends essentially on the orthogonality property of eigenvectors given by equation (6.7). The method is described sufficiently clearly by an example, and we give no general discussion. Other methods for removing known eigenvalues and eigenvectors from a matrix are 136 NUMERICAL METHODS I §6.3 described in Modern Computing Methods, H.M. Stationery Office, 2nd Edn. (1961), Chapter III. Ex. 6.3. Suppose that the second largest eigenvector of the matrix used in Ex. 6.2 is given by x ' 2 = [x„ x 2 , x 3 , x 4 ]. From the equation x 2 x, = using the value found in Ex. 6.2 for x, we find x 4 = -0-0207*, - 0- 1 836* 2 - 0- 1 265* 3 . (6.25) If we write out in full the first linear equation of the matrix system Ax = ;.x where A is the matrix of Ex. 6.2, we have -0030.y 1 -0-242.v 2 -0-603x 3 +0-178.v 4 = Ax,. The unknown x 4 can be eliminated from this equation by means of (6.25). In the same way x 4 can be eliminated from the left-hand sides of the other three equations in Ax = Ax, and finally we obtain - 0-0337*, - 0-2747* 2 - 0-6255* 3 = be, , (6.26a) -0-2501*, +0-7878*2-0-3927* 3 - A* 2 , (6.26b) -0-608ZV, -0-3891*2 + 1 3182* 3 = A* 3 , (6.26c) 01 236*, - 00899*2 - 008 1 7* 3 = Ax 4 . (6.26d) The first three equations represent an eigenvalue problem for the 3 x 3 matrix B = -0-0337 -0-2747 -0-62551 -0-2501 +0-7878 -0-3927 L-0-6082 -0-3891 +1-3182 (6.27) Suppose that the largest eigenvalue and the corresponding eigenvector of the matrix (6.27) has been found by the method of § 6.2 or otherwise. From the values of*,, * 2 , * 3 the value of * 4 can be found from (6.25) and the final values of the *, can be checked from (6.26d) or vice-versa. In this § 6.3 EIGENVALUES AND EIGENVECTORS way we obtain 137 A 2 = 1 -6500, x 2 = -0-3119 -0-3650 1-0000 -00530 Suppose next that \y u y 2 , .v 3 , v 4 ] denotes the transpose of one of the two remaining eigenvectors x 3 or x 4 . This vector is orthogonal to x, and x 2 which are now known, so that -0-3119j,-0-3650>'2-l-l-0000.v 3 -0-0530v 4 = 0, 0-0207j,+0-1836>'2 + 0-1265v 3 + l-0000.v 4 = 0. (6.28) Elimination of j- 4 between these two equations gives >- 3 = 0-3087j,+0-3529;' 2 . (6.29) On recalling the method of derivation of (6.26) we see that (6.26) still hold if the x, are replaced by ;>,. If in (6.26a-c) we eliminate y 3 by means of (6.29) we obtain -0-2268><,-0-4954>-2 = A;-„ -0-3713^, +0-6492J, = Av 2 , -0-2013j,+0-0761.V2 - ty* (6.30) The first two equations represent an eigenvalue problem for the 2x2 matrix c = r- [:: 2268 -0-4954 3713 0-6492 1 The easiest method of procedure is now to solve the eigenvalue problem det [C-A/] = by expansion of the determinant and direct solution of the resulting quadratic. Back-substitution in (6.29) and (6.28) give the remaining components of the eigenvectors of the original matrix A 138 NUMERICAL METHODS §6.4 and (6.30) is a check. This gives, on normalising so that the largest elements are unity, X 3 = 0-8241, x 3 = -0-4714 10000 0-2074 -0-2001 : ;. 4 = -0-4017, x 4 = 10000 0-3533 0-4334 -01741 §6.4. The iterative solution of linear simultaneous algebraic equations. It is convenient to consider certain iterative methods for the solution of linear equations at this point, since the convergence of these procedures can be discussed by means of the theory of eigenvalues and eigen- vectors. We start by stating two classical iterative pro- cedures, considering for simplicity the case of three equations in three unknowns. It is assumed that the elements in the principal diagonal of the coefficient matrix are non-zero. Without loss of generality these can be taken to be unity. (a) The equations for the method of simultaneous corrections are xT»>« -»ia*P -a n x?+b„ -v ( / +,) = -a 2 ,x[ r) -a 23 x<? + b 2 , (6.31) X 3 — — a 31 \ t — "32*2 +O3. (b) The equations for the method of successive correc- tions are *r»- -a.2*';' -«.3*3 r) +v X2' +,> = -fl21* ( i r+ " -«23*3 r) +*>2. (6-32) x 3 '+»=- a3l xr»-a 32 x 2 '+» +b 3 . The new estimates of the x t are used on the right as soon as they are obtained. §6.4 EIGENVALUES AND EIGENVECTORS 139 A general class of iterative procedures, which includes these methods as special cases, can be discussed in the following way. We solve the equations Ax = b, (6.33) where A is an nxn matrix, by splitting the matrix A in the form A = E-F. Then Ex = Fx+h, (6.34) and an iterative procedure is derived by writing Ex (r +" = Fx" + b, (6.35) where x ( " is the rfh approximation to the exact value x. The iteration is started by setting x {0> equal to an arbitrary column matrix. If it is possible to obtain an approximation to the exact solution x it may be convenient to take x (0) equal to this estimate of a: but often it is sufficient to choose x (0 ' = 0. The matrix E should be chosen so that it is easy to solve the set of equations (6.35) for x (r+1 \ assuming that x (r) is known. This implies that det E is non-zero, and E~ ' exists. We have seen in Chapter V that it is easy to solve sets of equations if the matrix of coefficients is of lower triangular form, and E is often chosen to be of this type. An example is the method of successive corrections quoted above. Instead of working in terms of the estimated values of the unknowns it is useful to introduce the corrections c ('+l) _ x lr * l) -x (r> . (6.36) As in the discussion of other types of iterative procedure in Chapter II, if x (,) tends to the required value x as ;• lends to infinity we say that the procedure is convergent. If the procedure is convergent x = x (0) + £ c<". r = 1 (6.37) A necessary condition for convergence of the infinite series 140 NUMERICAL METHODS I §6.4 in this equation is that c (r) should tend to zero as r tends to infinity. On writing r— 1 in place of r in (6.35) and subtracting the result from (6.35) we see that c (r+1) = Kc ir \ where K = E~*F. (6.38) Hence c (r + » = /r H c <o> (639) The convergence of the procedure therefore depends on the behaviour of A' f as r tends to infinity. This can be expressed in terms of the size of the eigenvalues of K by the following argument which should be compared with the analysis at the beginning of § 6.2. We consider only the case where the matrix K possesses n linearly independent eigenfunctions 2, with corresponding eigenvalues ). t , so that A'z, = ;.,:,., or (F-/.,E)z ( - 0. (6.40) It is assumed that there is one eigenvalue of maximum modulus, say ). u where ?., is real. The other eigenvalues can be complex. From (6.9) the vector c (0) can be written in the form r<°> fl,: l +fl 2 z 2 + ... + fl„:„. (6.41) From (6.39), (6.40), (6.41), c ,r+l > = fll ;/ 1 +, 2l +fl 2 A 2 +| Z2 +...+ a „;: n +, v (6.42) For large r, R, since )., is the eigenvalue of largest modulus, cfr + » « «yg"«t, c (R+s) * X\c w . (6.43) If we perform R iterations and then estimate c (R + ,) for s>0 by means of this result, equation (6.37) gives x = x (0) + £ c"+c*, where r "TF+ 1 c"> m c"° S-l 1-A, (6.44«) (6.446) §6.4 EIGENVALUES AND EIGENVECTORS 141 We have assumed that \?. t \ < I, and this is obviously a necessary condition for convergence of the procedure. This formula provides an estimate of the error at any stage of the iterative procedure provided that there is one real eigenvalue X, which is larger in modulus than the others, and the approximate value of ;., is known. From (6.43) the value of A, can be estimated from the ratio of corresponding elements of successive c (0 . The consistency of the values of /., obtained at various stages of the iteration is an essential check on the validity of the basic assumption that there is one eigenvalue which is larger in modulus than the others. We define the rate of convergence /> by the formula P = - In I A, |, (6.45) where ?., is the eigenvalue of AT with the largest modulus, as defined in the last paragraph. To illustrate the significance of/) we introduce the error vector c w m x (r)_ x Subtraction or (6.33) from (6.35) gives e (r+ " = Ke< r) = tf r+1 e (0) . As in the analysis leading to (6.43) we can show that for large r the error in any component of the rth iterate will be nearly equal to /.', />, where p is some constant. The number of iterations required to reduce the error in this component to, say, e is given by Taking logarithms r In (| p l/e) m In (| p |/«) -to fil p Hence the number of iterations required is inversely proportional to the rate of convergence. The smaller the value of | Xi \ the greater the rate of convergence. 142 NUMERICAL METHODS §6.4 We can compare the rates of convergence of different iterative procedures for solving the same set of equations if we can compare the magnitudes of the eigenvalues of maximum modulus of the corresponding matrices. An instructive elementary discussion of rates of convergence can be given for the special case of a band or tridiagonal matrix, the elements of which are zero except on the principal and immediately adjacent diagonals. Without loss of generality we can assume that the principal diagonal elements are unity so that we can set A = L + I+U where / is the unit matrix and in the 3x3 case we can set A m 1 t h I, 1 .0 h 01 , L = ro o o- , u = "2 /, 1 . .0 i 2 o. u, u The methods of simultaneous and successive corrections, equations (6.31), (6.32), can be written, respectively, x ( ' +,) = -(L+U)x< r) +b, (6.46) (I+/)x <r+1) = -Ux iri + b. (6.47) We examine a generalisation of (6.47) which, as explained later, is known as the method of successive over-corrections: (L+pl)x (,+ u = -{(l-p)I+U}x ir) + b, (6.48) where p is a constant which will be determined so as to yield the fastest rate of convergence. The method of successive corrections is the special case/? =1. We shall compare the rales of convergence of (6.46) and (6.48). From (6.34), (6.40), (6.45) this means that we need to compare the eigenvalues A, p defined by the equations (6.46): [_(L+U)+Xr]u = 0, (6.49) (6.48): [{(!-/>)/+ £7}+/i(I + pJ)> = 0. (6.50) §6.4 EIGENVALUES AND EIGENVECTORS 143 By means of the following procedure we can show that /. and p are directly related. Write (6.50) out in detail. Multiply the /th equation by x$ = 1, 2, 3) and set P = (l-p)+pp, Vj = PjVj, U= 1,2,3), where the V t are new unknowns. The a,-, /?, are constants which are at our disposal. The equations (6.50) become AfcAPk+VtatAT, =0, Ph«2PiVi + Pa z p 2 V 2 + u 2 a 2 p 3 V 3 = 0, (6.51) pl 2 cc 3 p 2 V 2 + Pa 3 p 3 V 3 = 0. We next identify these equations with (6.49). To make the terms off the principal diagonal agree, we require «lft = «2/»3 = 1. a 2 /?,/i = a 3 p 2 p = 1. By means of these equations, four of the six a„ fij can be expressed in terms of the remaining two. The way in which this is done is immaterial. Suppose that we set Then (6.51) become fetftW*?!* "l^ =o, / 1 K 1 + (« 2 /«,)PK a + H 2 ' / 3 = 0, hV 2 H*J*i)Pn- l v 3 = o. These are identical with (6.49) if «l/«2 = /'*. V=u P = (l-p)+pp = f i*A. (6.52) 144 NUMERICAL METHODS I §6.4 First of all we consider the method of successive correc- tions, p — 1. Then (6.52) gives p = ?}. This result shows that in the case of a band matrix the eigenvalues for the method of successive corrections are the squares of those for simultaneous corrections. Hence for a given problem involving a band matrix the two methods both converge or they both diverge. If they converge, the rate of convergence is doubled if we use successive instead of simultaneous corrections. To determine the value of p which gives the fastest rate of convergence for any given '/. we discuss the dependence of the roots of the quadratic (6.52) for p i on p. When the roots are real we need to consider the root of larger modulus, namely I /«* I = (2/>)-{| X \ + Jtt 2 -4p+4p 2 )}, p l <P<P 2 , (6.53) where /7,,/j,are the roots of thequadratic4/> 2 -4/7 + ;. 2 = 0. ForO<|A|<l we hwe0<p,<±<p 2 <]. When the roots are complex | /i | = (/>-'- 1), p t <p<p 2 (6.54) By drawing graphs of | p | against p for various ;. it is easy to see that for convergence we need consider only p>\, and then | /< | decreases for \<p<p 2 , and increases for p>p 2 - Also the largest | p | corresponds to the largest | X |. Hence the rate of convergence is greatest when P-P2-HW0-*?)}. (6.55) where ). x is the root of maximum modulus for the method of simultaneous corrections. From (6.53), (6.54) the eigen- value of maximum modulus corresponding to the optimum p is given by /ll =(2 i ,)- 2 ;. 2 = p-'-i = {1-V(1-A,)}{1+V(1-A 2 )}- (6.56) §6.4 EIGENVALUES AND EIGENVECTORS 14S The value of the method can be illustrated by supposing that A, = 1-e where e is small so that the methods of successive and simultaneous corrections converge slowly. Then lit x {1- V /(2£)}{1 + 7(2 K )}-' * 1-27(26). If A, = 0-995, e = 0005, then /i,ss0-8 and a remarkable increase in the rate of convergence is obtained. We can rearrange (6.48) in the form jCr-M) = X (0 +ft) {_ tx (r + l)_ (/+y)x (r)_ fc}) ( g 5?) where <u = I /p. If /> = <y = 1 we have the method of successive corrections and the bracketed term on the right is the correction added to x (r) to give x (r+l) . From (6.55) since U, | < | we have ±</><l, or to>], so that in the general method the bracketed term is multiplied by a factor greater than unity. This is the reason for naming the procedure the method of successive over-corrections. The following example illustrates numerically some aspects of the iterative procedures described in this section. An application of these methods is given in Ex. 11.9. Ex. 6.4. Obtain the results given in Table 6. 1 . for iterative solution of the following equations, working to five decimal places. 20*,- 8x 2 =6, - 4.v,+20.v 2 - 4x 3 =6, (6.58) - 4a- 2 + 19jc 3 - 4.v 4 = 5, - 8x3+20*4 = 1. In all cases the iteration was started with the estimates 20x«°' = 6, 20x 2 °> = 6, I9x 3 ' = 5, 20x 4 °> = 1. 146 NUMERICAL METHODS §6.4 The eigenvalue of maximum modulus was estimated from *■ "{.?.**"}/{,?. 4 r=3 ' 4 - For the methods of simultaneous and successive corrections the results in Table 6.1 give/, %0-4and A, a 0-2 respectively, though the iteration in the latter case has not proceeded far enough to give an accurate estimate. The c* in Table 6. 1 Table 6.1 Results for the iterative solution o/(6.58). AH figures have been multiplied by 10 5 . Estimated Method /" c«> c«> c<» «t x i Error Simult. correct. (6.31) 1 1550 727 239 160 491 81 -30 2 1818 597 299 200 480 51 +24 3 1436 769 247 165 408 89 -37 4 1835 574 308 206 213 97 +27 Success, correct. (6.32) 1 1257 400 74 18 49214 +3 2 999 185 31 8 480 28 + 1 3 525 83 13 3 409 25 -1 4 210 33 5 1 213 69 -1 Success over-correcl. 1 918 297 15 1 492 11 (6.57 w= 1044. 2 809 68 4 480 28 + 1 3 197 17 1 409 26 4 47 5 213 70 are then obtained from (6.44b). For the method of succes- sive over-corrections we use, from (6.55) with A, = 0-4, co=\lp = 2{1 + 70-84}" 1 xl 044. In practice, of course, when to is estimated, we should not start from the beginning again with the original x\ 0) as we have done in Table 6.1, but we should start from the currently available estimates of the x,. In computing by hand, since the result is obtained by §6.4 EIGENVALUES AND EIGENVECTORS 147 I adding corrections and the accuracy of the extrapolation also depends on the accuracy of the corrections, it is essential to apply checks. Consider the method of successive corrections applied to a 3 x 3 set of equations in (6.32). On adding the equations and subtracting from the result the corresponding equations with r- 1 in place of r we have (l+«2 I +fl3.)ci r+,) +(l+fl32)4 r+1, +<-r ,) where c{ r) is the /th component of c (r) . A final check should also be applied by carrying out one iteration with the final estimates of the x,. Examples VI Ex. 6.5. Show that the sum of the eigenvalues of A equals the sum of the diagonal elements of A and the product of the eigenvalues is equal to det A. If the matrix A has distinct non-zero eigenvalues A, show that A~\A 2 have eigenvalues XT 1 , % respectively, and that A, A~\ A 2 have the same eigenvectors. Ex. 6.6. Prove the following statements. The matrices A -pi and A have the same eigenvectors. Iff/ is an eigen- value of A -pi then p + ij is an eigenvalue of A. To find the largest eigenvalue of A by the first-order iterative procedure of § 6.2 it is in general quicker to iterate with A— pi rather than A, for some suitable choice of p. Suppose that /., is positive and that the eigenvalues are arranged in the order ^>Aa^-..^A, with | A, | > \X 2 | and I a, | > \X n |. The fastest rate of convergence is obtained when p is chosen equal to iU 2 + /„). For this choice of p assumption (iii) of § 6.2 is no longer valid and if it is desired to use Aitken's <5 2 -process it may be better to choose p = iU 3 + A„). The difficulty in applying these results is of course that / 2 , ;. 3 , /„ are usually not known. 148 NUMERICAL METHODS §6.4 5 6.4 EIGENVALUES AND EIGENVECTORS 149 When using an automatic computer it may be worthwhile to find a suitable value of p empirically. If it is known that all the roots are positive then a simple choice for /* is |Aj where A* is an approximate value for A t found by iteration, using A. The above device can be used to find the smallest eigenvalue X„ of A by choosing p = 4(A t + X n - 1) or iWi+^n-2)- Tnis procedure may not be very satis- factory if the lowest roots are close together, which is often the case. Illustrate these remarks by discussing the case of a fourth- order matrix with eigenvalues 4, 3, 2, 1. Ex. 6.7. Show that if ;., = -X 2 then, using the same notation as in § 6.2, «2«+i « A 2 '(a,x,-a 2 x 2 ). Deduce that if u p is any element of u p then uJu p - 2 lends to A? as p tends to infinity and show how x, and x 2 can be found. Show that if Ui | > U 2 1 and A 2 = -A 3 so that assump- tion (iii) of § 6.2 is not true, Aitken's <5 2 -process can be applied to even or odd order iterates, but not to any three successive iterates. Ex. 6.8. If the dominant eigenvalues of A are complex conjugate, suppose that they are the roots of A 2 + otX +P = 0. Show that for sufficiently large p «„+««„_, +/?u p _ 2 = 0. Deduce a method for finding a, ft, and hence the dominant eigenvalues of A. Ex. 6.9. In order to understand the emphasis on symmetrical matrices in this chapter we indicate how the theory of orthogonality of eigenvectors must be modified when A is unsymmetrical. Proofs of the following state- ments are left to the reader. A and its transpose A' have the same eigenvalues A, and we assume that these are distinct. Corresponding to each A, there are eigenvectors X, of A and j>, of A', and in general these are different: Ax i = AjX|, A y t = A|)>j. In general the x, are not orthogonal to each other, nor are the y h but ylxj = 0, (i + j). We say that the x, and y, are biorthogonal. An expansion of the form (6.9) exists for an arbitrary vector. Starting from arbitrary vectors u , v we can form two sequences u P = Au p _ l , t„ =^'b p .,. If u p , v p are any elements of u„, Vp then lim — ^- = A., lim P-co u„_ ^ = A,. p-i p-*to !>p_, Further discussion of the unsymmetrical case lies outside the scope of this book. Ex. 6.10. Show by doing the necessary numerical work that when finding the dominant eigenvalue of the matrix (6.10) by iterating with u' = [1, 1, I] the rate of convergence is more rapid than would be expected from the ratio XiUi. Show that if we use «(, =[1, 0, 0] the rate of convergence of the eigenvectors is given by X 2 IX { but the rate of convergence of the eigenvalue is again more rapid titan might be expected. Explain the reasons for these results. 150 NUMERICAL METHODS §6.4 Ex. 6.11. Show that the eigenvalues of the matrix A = 1 1 . 1 -a. -fl.-l -a n -i ■■ • -a 2 -«i are identical with the roots of the algebraic equation ;." + fl 1 A''- 1 + ...+a n = 0. (Many of the remarks made in previous chapters in con- nection with finding the roots of algebraic equations apply to the determination of eigenvalues.) Ex. 6.12. Show that if x is a first-order approximation to the eigenvector x, of A, so that x = x, +er where e is small, then L = x'Ax XX = X t — Ce 2 + higher-order terms, so that L is a second-order approximation to Xf The quantity L is called the Rayleigh quotient. Ex. 6.13. Write computer programs for finding the largest eigenvalue and the corresponding eigenvector of a matrix A using the following methods: (a) Iterate using (6.16) until successive estimates of A, and the components of x, differ by less than a quantity e specified in a " Read " statement. If the procedure iterates more than 30 times without achieving convergence, print out the last two sets of results and stop the machine. §6.4 EIGENVALUES AND EIGENVECTORS 1S1 (b) Instead of the criterion for convergence in (a) use the criterion that the quantity t p in (6.23) is positive until the " noise-level" of the calculation is reached (cf. Ex. 3.3). (c) Use the method in (a) with the following addition. Iterate until successive estimates of X x differ by less than 001 X* where X* is the current estimate of Xi. Compute the quantity t p in (6.23) and continue the iteration with (A-$X* I), (Ex. 6.6). Check that the new value of t p is greater than the old value. Otherwise continue iterating with,*. (d) Use the method in (a) with the following addition. Compute the quantity t p at each stage (p^3). If t p and t p _ j differ by less than 0- 1 t p apply Aitken's <5 2 -process and start the iterative procedure again. Discuss precautions that could be taken to avoid harmful effects due to the noise-level of the calculation in cases (a), (c), (d). (The fact that the noise-level can upset an orthodox iteration is one of the reasons why we may prefer to use the noise-level itself as the criterion for con- vergence, as in (b).) Ex. 6.14. Iterative methods for the solution of Ax = b are particularly useful if A is diagonally dominant i.e. if in any row of the matrix the diagonal element is greater than the sum of the absolute values of the non-diagonal elements. Consider the method of simultaneous correc- tions. Without loss of generality we can take the diagonal elements to be unity. If we define °i *• E I "» l« a = max a " j i where the sum is taken over all non-diagonal elements only, then for diagonal dominance a<\. Let e (r> (/) denote the /th element of the error vector c (r) = x-x (r) . If we define 152 NUMERICAL METHODS I § 6.4 E, to be the maximum value of e (r, (/) for / = I, 2, ... n then l« w coi»iz«u«^ i, co|gB,-i2;ifl U i^B r . 1 . Hence E r ^,oE r - x ^o'E . Hence the method of simultaneous corrections converges for diagonally dominant matrices and for such matrices the modulus of the largest eigenvalue is less than a. Ex. 6.15. If X is an approximate inverse of A, show that if we define X,+ , = X,(2I-AX r ), r = 0,1,2,..., then as r tends to infinity X r tends to A " '. The procedure is second-order. This is an analogue, for matrices, of an iterative procedure for finding the inverse of a scalar, x = \/a. (Set n = -I in the first formula in Ex. 2.10.) A necessary condition for convergence of the scalar pro- cedure is that | 1 -ax | < 1. The analogue for the matrix procedure is that the largest eigenvalue of the matrix I—AXq must be less than unity in modulus. Ex. 6.16. Write a computer program for the iterative solution of the equations Ax = b using the method of successive corrections. Start with the approximation x} 0) = bja u . At each stage of the iteration estimate the eigenvalue of maximum modulus by means of the formula it****-*! M r+,) i u -(') (c\ r) is the rth component of c (r) defined in (6.36). This formula can be justified by (6.43).) When successive estimates X ( {\ A ( {* l > are such that |4+»-Jl$»| <8\k\ r+i) \. §6.4 EIGENVALUES AND EIGENVECTORS 153 where 5 is a given number, say 005, adopt k l {* l) as the required estimate of the eigenvalue of maximum modulus X\. Correct the estimate of x by means of (6.44). Iterate four times, starting from the corrected estimate of x, and again apply (6.44). (Do not re-estimate X t .) Repeat this procedure until the maximum element of the correction vector C defined in (6.44b) is less than a specified number c. What provision would you make for non-convergence of the procedure and for non-validity of the correction formula (6.44)? What information would you print out if the procedure does not converge, to indicate why the method has failed ? (Ex. 1 1 .9 is concerned with a computer program for the method of successive over-corrections.) INDEX 155 INDEX Acceleration of convergence, 34, 133 Accuracy, 1-18, 41, 95-6, 117 (see also Errors) Ailkcn's 82-proccss, 33-6, 133 Algebraic equations, 19-53, 78 Automatic computer (sec Com- puter, automatic) Auxiliary matrix, 109 Back-substitution, 85, 109, 114 Bairstow's method, vii, 46-9 Band matrix, 1 19, 142 Biorthogonality, 130, 149 Characteristic roots (sec Eigen- values) Chebyshcv series, 15 Checks, 16, 23, 82-5, 107, 133, 147 Column matrix (vector), 103 Compact elimination method, 105-112 Complex roots of polynomials, 46-9 eigenvalues, 148 Computer, automatic, v, 54-5 and hand computing, 75-8 programming, 54-79 programs (see Programs) Condition, of algebraic equations, 41-6 of linear equations, 92-8, 117-18 Convergence of iteration, 21 acceleration of, 34, 133 criterion, 21, 65, 72 order of, 31 rate of, 26, 28, 141 Cramer's rule, 80, 124 Curve fitting, 120 82-procedure, 33-6, 133 Dependence of vectors, 128 Desk machine calculation, 11-15 and automatic computing, 75-8 square roots, 30 Determinants, 80, 88, 93, 94, 1 12, 120, 124, 128 Diagonal matrix, 104 Diagonally dominant matrix, 151 Digital computer (see Computer, automatic) Direct and iterative methods, 81 Eigenvalues and eigenvectors, 122-153 calculation of largest, 129-135 calculation of subdominant, 135-8 properties, 126-8, 147-150 Equations (see Algebraic or Linear equations) Error, viii, 1-18 (see also Round- off, Ill-conditioning) absolute and relative, 3 inherent, in linear equations, 92-8, 117-18 in iterative procedures, 31-7 in iterative solution of linear equations, 140 in polynomial equations, 41-6 in sums and products, 3-5 Expansion in orthogonal vectors, 128 Extrapolation, 33-6, 133, 140 False position, 51 Fixed and floating point, 6, 18 Flow-diagram, 62, 64, 74, 100 Hand-computing (see also Checks) compared with automatic, 75-8 desk machine calculation, 11-15 llermitian matrix, 130 Ill-conditioning, algebraic equations, 27, 42-6 linear equations, 92-8, 117-18 Inherent error, 92 (see also Ill- conditioning) Inner product, 127 (sec also Scalar product) Interpolation, linear, 12 Inverse matrix, 112-18 computation of, 113-15 Inverse matrix, iterative method for computing, 152 left and right, 116 Iterative methods, 19-53 convergence and divergence, 21 for inverse matrix, 152 for largest eigenvalue, 129-135 for linear equations, 81, 138- 147 order of, 31 programs for, 65-75 Latent roots (see Eigenvalues) Least squares, 120 Linear dependence and inde- pendence, 128 Linear equations, 80-101, 105- 118, 121 compact method, 105-112 errors, 92-8, 117-18 iterative methods, 138-147 successive elimination, 82-7 Linear interpolation, 12 Matrix, 102-122 band or tridiagonal, 119 eigenvalues of, 125 inverse, 112-18 left and right inverse, 1 16 singular and nonsingular, 113 Memory, 54 Mistakes, 15-17, 23 (see also Checks) Newton-Raphson iterative pro- cedure, 26-30, 70-5 154 '" 156 INDEX Noisc-lcvel, x, 10,66,72, 151 Nonsingular matrix, 113 Normalisation of vector, 128 Operator methods, viii Order, of iterative procedure, 31 Orthogonal vectors, 127-8 Pivots, 87, 115 Polynomial equations, 38-50 Polynomials, evaluation of, 13-15 Principal diagonal, 105 Programming, 54-79 Program, 55 for eigenvalues, 150 for iterative procedures, 65-75, 78-9 for iterative solution of linear equations, 153 for method of successive elimination, 98-100, 119 • for roots of polynomials, 72-3 Proper values (sec Eigenvalues) Scalar product, 103 (see also Inner product) Scaling, 6, 82, 87, 116 Significant figures, 2 Simultaneous equations (see linear equations) Singular matrix, 113 Square root, 30, 66 Square root method for linear equations, 119 Statement, 56 Statistical theory of roundoff error, 9 Successive corrections, method of, 138 Successive elimination, 82-7 Symmetrical matrix, 105 I, definition of, 36 Transcendental equation, 19-38, 50-1, 78 Transpose of matrix, 105 Triangular matrix, 105, 142 Tridiagonal matrix, 119, 142 Rate of convergence, 26, 28, 141 (see also Convergence) Rayleigh quotient, 150 Residual, 42, 75, 94 Round-off, 2-10 noise (see Noisc-lcvel) Row matrix (vector), 103 Unit matrix, 105 Vector, 103 expansion in terms of ortho- gonal functions, 128 PRINTED IN GREAT RRITAIN BY OLIVER AND BOYD LTD.. EDINBURGH UNIVERSITY MATHEMATICAL TEXTS Numerical Methods: 1 Iteration Programming and Algebraic Equations It is desirable that students of applied mathematics, the physical sciences, and engineering should be given an introduction to numerical analysis. The aim of this new title in the University Mathematical Texts series is to provide an elementary exposition of certain basic ideas. This has been done by considering a strictly limited selection of typical methods with particular emphasis on fundamentals, including error analysis, error estimation, pitfalls and the detection of mistakes. An effort has been made to avoid the common practice of regarding numerical methods as a collection of algorithms or recipes. Since the advent of automatic computeis has produced a revolution in computing, programming for computers has been made an integral part of the exposition. Volume 1 of this title deals with accuracy and error, iterative methods, algebraic equations, matrix methods and eigen values. Volume 2 deals with numerical methods for solving ordinary and partial differential equations. Net price 10s6d ;